Tag Archives: Deep Web
Crawling The Deep Web
Posted on 20. Jun, 2009 by search_junkie.
Today I played around with a little site called Pipl.com, its a people search engine.
But the cool thing about it is that it crawls the “deep Web.” The deep Web basically refers to pages that have no links, in other words the search engines cannot find them or index them. Password protected pages, large hidden databases of content, pages or directories blocked by the robots.txt file.
Now because Pipl does such a deep crawl it unearths what many might consider TOO much information about people. All of your social profiles will show up and of course public deep Web pages. It really is scary, go ahead, try it.
But this whole deep Web thing started to spark my interest, especially when I learned that the deep Web is estimated to be several magnitudes larger than the Web that we all know.
It got me thinking about what Google thinks of this, un-indexed, content. After all their mission is to organize the world’s content. Well guess what just came out not too long ago??? A python script to that allows one to create a full xml sitemap, meaning that it will find and crawly everything on your server. This is great news to many, who used to have problems getting their large sites completely indexed (may people developed their own custom python scripts).
To me, this seems like Google’s attempt at indexing some of the deep Web. Their index should grow substantially with this new sitemap building tool. What do you think? Is this Google simply trying to help webmasters or is it part of a bigger plan to crawl more of the deep Web?
Continue Reading
Crawling The Deep Web
Posted on 15. Feb, 2009 by search_junkie.
Today I played around with a little site called Pipl.com, its a people search engine.
But the cool thing about it is that it crawls the “deep Web.” The deep Web basically refers to pages that have no links, in other words the search engines cannot find them or index them. Password protected pages, large hidden databases of content, pages or directories blocked by the robots.txt file.
Now because Pipl does such a deep crawl it unearths what many might consider TOO much information about people. All of your social profiles will show up and of course public deep Web pages. It really is scary, go ahead, try it.
But this whole deep Web thing started to spark my interest, especially when I learned that the deep Web is estimated to be several magnitudes larger than the Web that we all know.
It got me thinking about what Google thinks of this, un-indexed, content. After all their mission is to organize the world’s content. Well guess what just came out not too long ago??? A python script to that allows one to create a full xml sitemap, meaning that it will find and crawly everything on your server. This is great news to many, who used to have problems getting their large sites completely indexed (may people developed their own custom python scripts).
To me, this seems like Google’s attempt at indexing some of the deep Web. Their index should grow substantially with this new sitemap building tool. What do you think? Is this Google simply trying to help webmasters or is it part of a bigger plan to crawl more of the deep Web?



