Different types of web robots—also known as Internet bots—routinely and automatically “crawl” the Internet, fetching and indexing website pages. There’s a standard that regulates this type of activity, preventing robots from partly or fully accessing websites. This convention is known as the Robots Exclusion Standard, but how many robots are there in the world?
HISTORY
In 1994, the main communication channel for World Wide Web-related activities was a mailing list named www-talk. It was here that Dutch software engineer Martijn Koster—then working for UK computer security company Nexor—proposed a standard to help solve the problems that robots cause. Although Koster acknowledged their “very useful services,” he also noted that “robots are one of the few aspects of the web that cause operational problems and cause people grief.” The standard he proposed was meant to maximize the benefits of robots and minimize their problems. With major robot writers voting in favor, the Robots Exclusion Standard was created by consensus in June of that year; it became the system that web crawlers have followed ever since.
BASIC DESCRIPTION
The Robots Exclusion Standard is represented by a text file named robots.txt. If you are a website owner who wants to give instructions to robots, you should place the file in the root of the website hierarchy. Thus it will read like this:
https://www.website.com/robots.txt
When robots see the robots.txt file, they read its contents before going anywhere else on the website. However, if there is no robots.txt file, the robots would assume that you do not have any instructions to give and thus would crawl the entire website. Also note that there are some robots that choose not to acknowledge robots.txt.
USER AGENT AND DIRECTIVES
Following a specified format, the instructions on the robots.txt file are usually requests for robots on what files to ignore when crawling the site. Additionally, many robots have to pass through a user agent, which is a software agent that acts on your behalf. For instance, if you want robots to stay away from the image.html file in the “pictures” folder, the request should look like this:
User-agent: *
Disallow: /pictures/image.html
You can also prevent specific or all robots from accessing certain files or directories of your website, or ban them altogether. For example, this instruction tells a certain robot (“EvilBot”) to stay away from the “private” folder:
User-agent: EvilBot # replace ‘EvilBot’ with the actual user-agent of the bot
Disallow: /private
And for shutting out all robots, use the following syntax:
User-agent: *
Disallow: /
NON-STANDARD EXTENSIONS
In counteracting the Disallow directive, robots activate their compatibility with the Allow directive. This is useful for instances when you want robots to crawl and index HTML documents in a directory that you’ve otherwise told them to avoid. For example, if you want to grant robots access to myfile.html while keeping them away from the folder in which it’s stored (“personal”), the syntax should read like this:
Allow: /personal/myfile.html
Disallow: /personal/
Another non-standard extension is the Crawl-delay parameter, which sets the number of seconds between request attempts to a server. It is typed like this (with the number representing the number of seconds):
User-agent: *
Crawl-delay: 5
NOINDEX
Also popular is the “noindex” directive, which prevents robots from indexing an entire web page. This meta tag is particularly useful if you are operating a website with very transitory web pages, a very large database, responsive theme (for mobile-friendly purposes), or material you want to keep private.
In addition to internet security, Blake Pickett writes on computer software, web development, mobile apps, gadget accessories (including the kensington ipad keyboard case), video games like 벳엔드 후기, and other interesting tech issues.