December 1, 2010
Web robots, also well known as Spiders, are programs that search engines such as Google and Yahoo use to troll the vast expanse of the World Wide Web to index web content.
Robots.txt is a text file many people put on their websites to inform the robots which pages of your site you would like them to stay away from. Please keep in mind that just because you have a Robot.txt file it doesn’t mean the robots have to follow your directions. It is more like leaving a friendly “Do not Disturb” note on an unlocked door; good robots (search engine robots) will turn and keep walking but Malware robots will most likely just open that dang door anyway. How Rude! So if you have really sensitive data, be aware that you cannot rely on your robots.txt page alone.
Your robots.txt file must be located in the main directory of your site (i.e. http://mydomain.com/robots.txt) or search engine robots will not be able to find it. If the search engines do not find it in your main directory they will not consider looking anywhere else, instead they will assume that your site does not have one and just index everything they find.
Here is a basic “robots.txt” that instructs all robots to not index any of your sites pages:
Here is an example for creating a robots.txt file to prevent Google’s Image bot from crawling your site’s images:
Please remember that you only need a robots.txt file only if your site includes content that you don’t want search engines to index. If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).