This Wednesday everything revolves around the topic Robots.txt. We want to look at how to create a Robots.txt, how to avoid the most common mistakes and which alternatives there are. Before I start, however, I'd like to explain a few basics.
With Robots.txt the webmaster has the possibility to define which subpages and directories of his website should not be indexed by the search engines. There are a number of reasons why pages or directories are excluded from indexing. For example, no pages should be indexed that are still under construction or are only used for private purposes.
To make this possible, Exclusion Standard was established in 1994 by an independent grouping of Robots. Meanwhile, the protocol is generally accepted and can be regarded as a quasi-standard.
The protocol specifies that a user agent (robot) first searches for a file called robots.txt in the root directory of the domain and then reads and interprets it.
!!Important!! - The file name must be written completely in lower case letters.
In this file you can define whether and how the website may be visited by a robot. The protocol is purely indicative and therefore dependent on the cooperation of the robots. The known search engines usually follow the instructions in the Robots.txt, as long as they are syntactically correct.
The exclusion of certain URLs of a web presence by the protocol does not guarantee secrecy. To really keep a document secret, you should use other methods such as Http authentication, an Access Control List (ACL) or a similar variant.
Now that I've gone a little bit into the basics, let's look at the structure of Robots.txt. A Robots.txt is easy to create, you just need a text editor. Meanwhile there are also some free tools for webmasters to automate the process. In the webmaster tools of Google there is also a Robots.txt generator. But for this a Google account is needed.
The Robots.txt consists of different records, which are structured according to a very specific scheme. A dataset consists of two parts. The first part specifies the robots (user agents) for which the following instructions should apply. In the second part, the instructions themselves are noted down:
User-agent: Googlebot Disallow:
With the User Agent we have therefore determined that this data record only applies to the Googlebot. In the next line we find an empty Disallow entry. If you do not specify a file or a directory for a disallow, this means that all pages may be included in the index.
The opposite effect has the use of a single slash (/), here the entire website is excluded from indexing:
User-agent: Googlebot Disallow: /
If you want to exclude certain files or directories for all robots, there is a so-called wildchar (*) - a placeholder that applies to all robots:
User-agent: * Disallow: /example-directory/
Of course, it can happen that we want to formulate a rule that only applies to the Googlebot and the Yahoo! web crawler, for example. The Robots.txt therefore also allows multiple entries. You can find the names of the different webcrawlers (robots) on the page robotstxt.org, for example. For those of you who want to know exactly, you can also have a look at the complete data of the robots there.
Some important user agents I have compiled in a small list:
|Msnbot / bingbot||MSN / bing|
User-agent: googlebot User-agent: slurp Disallow: /example-directory/
If you want to exclude several pages from indexing, a separate disallow line must be created for each file or directory. Specifying several paths in a disallow line leads to errors.
User agent: googlebot Disallow: /example-directory/ Disallow: /example-directory-2/ Disallow: /example-file.html
The Robots.txt does not allow regular expressions, but there is a way to exclude files that contain a certain string:
User agent: *
This rule would result in all URLs that start with /example not being included in the index. It does not matter whether it is a file (/example.html) or a directory (/example-directory/file-1.html).
The last general rule I'm talking about is excluding files with certain file extensions:
User-agent: * Disallow: /*.jpg$
At this point, the asterisk serves as a placeholder for any character string. The dollar sign at the end means that nothing more may follow after the file extension. We therefore have a means to exclude different file types, such as images, program files or log files from indexing.
There are a few more very interesting rules, but not all robots can interpret them. Therefore I will refer all following rules to the Googlebot, because it is able to understand these rules.
If you want to specifically exclude directories that start with a certain string, the following rule can be applied:
User-agent: Googlebot Disallow: /example-directory*/
For example, the directories /example-directory-1/ and /example-directory-2/ would not be indexed.
It often happens that the same page appears several times in the index of the search engines due to the use of parameters. This can happen, for example, through the use of forms or certain filter functions:
User-agent: Googlebot Disallow: /*?
This rule excludes all paths that contain a question mark in the URL from indexing.
Another entry often found in Robots.txt is a sitemap:
This entry tells the robot where to find the sitmap of the page. At this point all sitemaps of a page should be listed.
Multiple entries should be specified as follows:
Sitemap: http://www.site.com/sitemap.xml Sitemap: http://www.site.com/sitemap-bilder.xml
The IETF (Internet Engineering Task Force) introduced the Disallow statement as well as the Allow statement, which is not yet supported by every robot. So you should rather do without it and limit yourself to Disallow statements.
Of course, if you have longer rules, errors can sneak in quickly, so you should have the rules checked again. One possibility is offered by the Google webmaster tools (website configuration ->crawler access), another tool can be found here and here. For the latter two tools, the Robots.txt must already be on the server.
Now that we have gone into the creation of a Robots.txt in detail, we want to look at another alternative. Robots.txt is not the only way to tell search engines which pages can be included in the index. An alternative is the Robots meta tag, which is defined like the other meta tags in the head area of a page. This variant is useful to exclude individual pages from indexing. The exclusion of whole directories is not possible here. However, if you want to be sure that a page does not appear in the index of search engines, this is the safer option.
<meta name=“robots“ content=“noindex, follow“ />
With this entry, we can tell the search engine robots that the page should not be indexed, but that the links on this page should be visited by the crawler.
If you also want to prohibit the search engines from archiving a page, you can add a third value:
<meta name=“robots“ content=“noindex, nofollow, noarchive“ />
Finally I would like to say a few words about the Robots.txt. What you always have to keep in mind is that an entry in Robots.txt does not guarantee that a page will not be indexed. If you really want to be sure, you should set the corresponding page to noindex via the Robots meta tag. In this short video Matt Cutts deals exactly with this problem:
Finally, I would like to give a few hints that should be considered when using Robots.txt: