Tag Archives: Google
Crawling The Deep Web
Posted on 20. Jun, 2009 by search_junkie.
Today I played around with a little site called Pipl.com, its a people search engine.
But the cool thing about it is that it crawls the “deep Web.” The deep Web basically refers to pages that have no links, in other words the search engines cannot find them or index them. Password protected pages, large hidden databases of content, pages or directories blocked by the robots.txt file.
Now because Pipl does such a deep crawl it unearths what many might consider TOO much information about people. All of your social profiles will show up and of course public deep Web pages. It really is scary, go ahead, try it.
But this whole deep Web thing started to spark my interest, especially when I learned that the deep Web is estimated to be several magnitudes larger than the Web that we all know.
It got me thinking about what Google thinks of this, un-indexed, content. After all their mission is to organize the world’s content. Well guess what just came out not too long ago??? A python script to that allows one to create a full xml sitemap, meaning that it will find and crawly everything on your server. This is great news to many, who used to have problems getting their large sites completely indexed (may people developed their own custom python scripts).
To me, this seems like Google’s attempt at indexing some of the deep Web. Their index should grow substantially with this new sitemap building tool. What do you think? Is this Google simply trying to help webmasters or is it part of a bigger plan to crawl more of the deep Web?
Continue Reading
Case Sensitivity in SERPs
Posted on 04. Mar, 2009 by search_junkie.
I noticed today, that when entering in the same keyword phrase with and without capitals it pulled up different SERPs in Google. Now this was a revelation to me becuase in all my years I have never seen this happen unless it was an abbreviation or some really odd cases for specific brand names. But for a generic keyword phrase, never! Go ahead, test it out if you don’t believe me.
Try a search for “search marketing” and then try searching for “Search Marketing.” Bam! Different results. Make sure that you are signed out of your Google account when you search just to keep things isolated for this experiment. Now the results are not drastically different but they are nonetheless, different!. Please tell me if this is just something that has been going on for a while now and I just never noticed it. But from what I can tell this is new.
What does this mean for SEOs around the world? Well, things just got more complicated. Do you build links with anchor text that has both caps and no caps? Probably. But that just doubled your cost of link building. Is this a tactic for Google to lay a blow to the paid linking community which it cannot control? Probably. Does this serve up the most relevant SERP to the end user? I think not. I mean really, can one assume that a user’s intent is any different if they capitalize or not? Not really.
Oh Google, please come to your senses and rectify this.
Continue Reading
Crawling The Deep Web
Posted on 15. Feb, 2009 by search_junkie.
Today I played around with a little site called Pipl.com, its a people search engine.
But the cool thing about it is that it crawls the “deep Web.” The deep Web basically refers to pages that have no links, in other words the search engines cannot find them or index them. Password protected pages, large hidden databases of content, pages or directories blocked by the robots.txt file.
Now because Pipl does such a deep crawl it unearths what many might consider TOO much information about people. All of your social profiles will show up and of course public deep Web pages. It really is scary, go ahead, try it.
But this whole deep Web thing started to spark my interest, especially when I learned that the deep Web is estimated to be several magnitudes larger than the Web that we all know.
It got me thinking about what Google thinks of this, un-indexed, content. After all their mission is to organize the world’s content. Well guess what just came out not too long ago??? A python script to that allows one to create a full xml sitemap, meaning that it will find and crawly everything on your server. This is great news to many, who used to have problems getting their large sites completely indexed (may people developed their own custom python scripts).
To me, this seems like Google’s attempt at indexing some of the deep Web. Their index should grow substantially with this new sitemap building tool. What do you think? Is this Google simply trying to help webmasters or is it part of a bigger plan to crawl more of the deep Web?



