SEO

False Reports About Yahoo! Blocking Googlebot On del.icio.us

By February 21, 20082 Comments

A recent post on web developer Colin Cochrane’s that got more attention than it deserved on Sphinn and the SitePro newsletter mistakenly claims that Yahoo! has decided to play hardball with the competition.

Over the last weekend Colin found that the robots.txt file on Yahoo!’s social bookmarking property del.icio.us blocked search engine spiders including Googlebot from crawling certain directories. The extract from the robots.txt file on del.icio.us pasted below shows the “offending” code:

User-agent: Googlebot
Allow: /
Disallow: /inbox
Disallow: /subscriptions
Disallow: /network
Disallow: /search
Disallow: /post
Disallow: /login
Disallow: /rss

Colin then spoofed the Googlebot to see what was being delivered to the spider when it tried to access one of those pages, and found that he was being delivered a 404 error.

Colin’s erroneous post

Yahoo! has recently tested featuring information about del.icio.us bookmarks on its search results pages. Colin, along with a bunch of other SEO enthusiasts immediately jumped to the conclusion that Yahoo! is making use of its right to prevent competitors from benefiting from del.icio.us.

What Colin and the others are overlooking is three extremely important facts:

  1. The directories being blocked by the robots.txt contain general administrative pages such as the “Add a URL” page, which do not need to be indexed by the search engines.
  2. The 404 result is delivered because the user is spoofing a search engien spider, and the del.icio.us site is most likely smart enough to detect this and therefore does not allow access to its pages in order to prevent content scraping.
  3. Del.icio.us, like most other websites relies on Google for traffic, and blocking the spider from accessing its content would amount to del.icio.us shooting itself in the foot!

The fastest way to check the validity of Colin’s claims is by checking if Google has been able to spider and cache any pages from del.icio.us after Colin’s original discovery…

Pages spidered by Google from del.icio.us in the past 24 hours

Clicking on the link above will immediately show that the Googlebot continues to access del.icio.us.

The Problem With SEOs

While the topic of del.icio.us blocking or allowing Googlebot is relatively minor, the buzz surrounding this post highlights a much bigger problem in the industry: the fact that a self-proclaimed pundit can cry “wolf”, and in no time a whole bunch of clueless “sheep” will run scared, turning what should have otherwise been someone’s minor error into SEO gospel.

2 Comments

  • I felt it would be prudent to respond myself.

    1) The directories being blocked were not the issue. The robots.txt reference was used solely as a list of user-agents to test against.

    2) If del.icio.us is serving these 404s to prevent spoofing, then it is being done to prevent proxy hijacking, not content-scraping. A content scraper could just spoof a normal Mozilla user-agent if it was worried about getting caught.

    On a final note: I’m not sure at what point I became a “self-proclaimed pundit”. I simply encountered unusual behaviour from del.icio.us, did a little investigating, and wrote about what I found. People took from that what they did.

  • Kamrul Hasan says:

    Hi, I have a question. I think you will be able to halp me.
    I am finding lots of bookmarking sites with user-agent disallow. For example connotea.org . When I click on |info| of any bookmark it takes me to its parmalinks. And older parmalinks (bookmarked by some one) of those sites does not have any Google cache. Yet people says those are do follow bookmarking sites. I am confused.
    How those sites can help our sites SEO?