Robots.Txt - The Often Forgotten File
This is one of those files that almost any web bot will look for on your server. If it is there, it can be useful. If it isn’t, your log file will be filled with 404 errors. So, what is it? Well, it is literally a file on your server called robots.txt. It is a plain text file that is placed into the root directory of your website. When a search engine crawls your site, it will look for the robots.txt file for instructions. What’s it good for?
The file is used to control what search engine bots do when they come to your site. One of the common uses is to block a bot from crawling certain directories on the website, such as an images folder or a folder containing certain scripts that you may not want indexed. The file must contain correct syntax otherwise it could potentially adversely affect the way that bot interacts with your site. So, for example, to block a particular bot from your site, you use the syntax:
User-agent: *
Disallow: /
So, for example, let’s say you want to keep Google’s Image Search from indexing your site’s images. You would use the following in your robots.txt file:
User-agent: Googlebot-Image
Disallow: /
You can view a list of all search bots here or here to help you know what to specify for User-Agent. To see a list of unsafe bots, check out this list of 135. To disallow ALL bots from certain folders on your site, do something like the following:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
If, for some reason, you wanted to ban bots from your whole site, you would use:
Disallow /
You can’t use a wildcard in the Disallow field. Wildcards can be used in the User-Agent field, but not the other.
Do you need a robots.txt file? No, but like I said, not having one will lead to a bunch of 404 errors for it in your server log file. That said, your site will function just fine without one. Without a robots.txt file, a search bot will simply assume that it is OK to crawl everything on your site.
The other rules to keep in mind are:
- One command per line (you can’t stack user-agents or disallows on a single line)
- Only one robots.txt per domain, located in the site’s root
- File must be called robots.txt (all lowercase)
Don’t want to mess around with manually creating this file (as if its that hard), you can use a variety of free online tools to create one for you, such as:
- Yellowpipe robots.txt Generator
- Advanced Robots.txt Generator (a paid program)
Why would you want one?
- Block unwanted bots, like image search
- You can direct certain bots to certain content. For example, you might want to control who crawls foreign language content, or bots from specialized engines can be directed to certain targeted content.
- You can prevent unwanted bots from overworking your server
Remember to validate your robots.txt here.
Hopefully that takes any potential mystery out of this little file. I recommend putting up one if you haven’t already. Even if you simply put it up there to allow all bots, at least its presence will spare your log files from all the 404 errors.
You Might Also Like:
- Yahoo Slurp Spider Drives Forum Server Load Through the Roof!
- The Prevention of SPAM on Your Website
- Tactic For Building Your Swipe File [#16]
- Setting up a Gallery with Wordpress
- A Look at AJAX
If you enjoyed this post, please consider to leave a comment or subscribe to the feed and get future articles delivered to your feed reader.







[...] Well, the mystery of it is that my server load was still relatively high even after this change in my server setup. Yes, there was an improvement, but the load was still not in the range I would have expected. Well, yesterday morning, I went over to the PCMech Forums as I do every morning to check things out. I noticed that there were over 900 guests on my forum! A couple weeks ago, I made a post here about the robots.txt file and how you can control spiders across your site. I put two and two together and decided to check out the listing of all people on the forums at that time. Sure enough, the Yahoo Slurp Spider was VERY active on my forums, so much so that there were probably at least 700-800 instances of Yahoo on my forums at the same time! [...]