![]() |
robots.txt: People Don’t Always Want Search Engines to Crawl Their ContentMay 18 2007 | Search Engines, SEO |

What goes through your mind when you read about the silly lawsuits against Google accessing portions of your website? What do you think when you visit the Internet Wayback Machine and find hundreds of pages of your site in its full form (almost)? Most of you wonder what is going on in the minds of these clueless people. Don’t they understand how the web works?
That’s right, folks. In case you’re not in the know, the web works in a certain way. A brand new site generally does not get indexed in search engines for a period of months. Over time, the search spiders find your site and your interlinked pages begin getting crawled. Eventually, someone will search for something and your website will hopefully come up.
Not everyone is happy with these search results, and oddly, some people just don’t want to be found. In fact, on websites that are relatively large, the spiders crawl so many pages at once that people have defined the silliest robots.txt files. For example, check the hilton.com robots.txt file (which I learned about during the robots.txt Summit at Search Engine Strategies last month). Note their first two lines.
# Daytime instructions for search engines
# Do not visit Hilton.com during the day!
Dear hilton.com webmaster: Search engine spiders have no understanding of English. Look at the picture that accompanies this blog post. The spiders run through a site, grabbing pieces of data to spin into a web — the World Wide Web — and then it moves on. It does not pay attention to your personalized messages (but on behalf of the spider, thanks for the attention). Spiders do not understand anything. Spiders are robots.
If you want more information about robots.txt, you can visit one of the premier sites for the Robots Exclusion Standard. Understanding the implementation is important. If you have personal information that you don’t want the search engines to find, you can block it out.
Simply use the following:
User-agent: *
Disallow: /mysecretdirectory
This code blocks all search engines from accessing content in “mysecretdirectory.” This is also helpful if you have concerns about duplicate content. The typical example is if you have printer-friendly versions of pages on your site, but you don’t want to be penalized by Google for having these pages available on your website. You could create a robots.txt file with the following code:
User-agent: Googlebot
Disallow: /printer-friendly
You’d obviously be replacing the /printer-friendly directory with the directory your printer-friendly documents reside upon.
There are additional applications of robots.txt. Some search engines, such as Google, will now let you specify the path of your sitemap in the robots.txt file as such:
Sitemap: http://www.mysite.com/sitemap.xml
You can also be selective and block off certain search engines, including the Internet Wayback Machine, as discussed earlier. This way, old versions of your site will no longer be accessible.
User-agent: ia_archiver
Disallow: /
There are a variety of other bots out there that crawl your site on a regular basis. It’s not just about Google, MSN, Yahoo, or Ask. You can get an idea of what works and what doesn’t by experimenting. Note that if you plan on blocking content, it takes time for it to drop out of the search results if these pages were indexed already.
Fortunately, in my post about the Google Webmaster Central tool, I mentioned that you can change the crawl rate of the Google spider. Unfortunately, you can’t be any more specific and invite the spider only during late night hours. But you can set the speed for the spider to crawl your page at a slower rate so that it doesn’t negatively impact your web server or website performance.
Two helpful tools on robots.txt are the Robots.txt Generator and the Robots.txt Builder Tool. You are also encouraged to read the additional helpful documentation at the Google Webmaster Help Center.
Before you get mad at the search engines and start frivolous lawsuits, realize that it is your responsibility too to prevent the searche engines from accessing your web page, if that’s your desire.
Posted by Tamar Weinberg at 7:39 pm
Bookmark this post:
6 Responses to “ robots.txt: People Don’t Always Want Search Engines to Crawl Their Content ”
Trackbacks & Pingbacks:
Comments:
-
anty says:
May 19th, 2007 at 3:26 amThis robots.txt file-line must be a joke. I don’t think that a webmaster can be so stupid to build a website but not knowing that spiders aren’t human.
None the less a funny an helpful post!
-
anty says:
May 19th, 2007 at 3:32 amAnother idea why the webmaster may could have added the comment-lines about the daytimes:
Maybe this robots.txt-file is dynamically served and during daytimes (whatever daytimes are on the internet) this file is displayed, else robots are allowed to crawle the site.
Just a thought, but this would explain this comment.
-
Tamar Weinberg says:
May 20th, 2007 at 12:48 amanty: I’ve seen this example a number of times. I don’t think so.
It’s late Saturday night (Sunday morning?) and I see the same file! -
Elmer Cagape says:
May 22nd, 2007 at 4:51 amIs this just a link bait for people to wonder what could be the reason they have this message?





May 21st, 2007 at 10:34 am
robots.txt: People Don’t Want SE’s to Crawl Content…
What goes through your mind when you read about the silly lawsuits against Google accessing portions of your website? What do you think when you visit the Internet Wayback Machine and find hundreds of pages of your site in its full form (almost)?…
May 21st, 2007 at 2:10 pm
SearchCap: The Day In Search, May 21, 2007…
Below is what happened in search today, as reported on Search Engine Land and from other places across the web:……