The robots.txt, a creation of the Robots Exclusion Protocol, is a file that is stored in a website’s root directory (.e.g., example.com/robots.txt), and is used to provide crawl instructions to automated web crawlers that visit your website. The robots.txt file is used by webmasters to instruct crawlers which parts of their site they would like to disallow from crawls. The file also set something known as crawl-delay parameters.
How does a robots.txt file work?
Search engines and crawlers have two main jobs:
Crawling or “spidering” the world wide web to discover relevant content;
Indexing the various content found that it can be served up to searchers who are looking for information.
Search engines follow links on the page to get from one site to another, during this process — ultimately, crawling across many billions of links and websites.
When a search crawler arrives at a website, the first thing it would do is search for the robots.txt file, before its commencement of spidering. The moment it finds one, the crawler will go through that file before scanning through the page. Since the robots.txt file contains information about how the search engines should crawl, the information found there will aid in further crawling action on this particular site. If the robots.txt file does not contain any directives that disallow a user-agent’s activity or if the site doesn’t have a robots.txt file, it will proceed to crawl other information on the site.
Note: Not all web spiders follow robots.txt directions. Malicious bots can be programmed to ignore directions from the robots.txt file.