Many webmasters and bloggers still don’t know much about Robots.txt and the vast importance it holds for any website.
Obviously, there are many resources available over the internet about Robots.txt, but most of them are offering you ready-made templates of Robots.txt which you can just copy and paste for your own site. Easy stuff, isn’t it? But it may not work the way you want it to, as it has not been built focussing particularly your site on mind.
Each and every site in the Web is different and so a ready-made Robots.txt can never be a perfect solution. You need to design a custom Robots.txt file focussing on the SEO, Security and Server environment of your site.
So, in this article I will try to guide you to learn and understand how the Robots.txt file actually works and how you can create your own perfectly optimized Robots.txt.
(Image Source: blog.bootstability.com)
What is Robots.txt?
Robots.txt is simply a TEXT file containing certain rules to control the way search engines behave with your site. It is just a TEXT file and can be created using any Text Editor such as Notepad and must be present in the root directory of your site.
In WordPress, there is also a provision for virtual Robots.txt file, which is a bit different from physical Robots.txt. But we will get into the details of it in the later sections.
Generally, the search engines are designed in a way to index as much information as they can, and Robots.txt file is essentially a way to restrict their access to your site. Now you may ask why would you want to restrict a search engine robot from crawling and indexing my site openly?
It is a myth that, more the search engine robots crawl your site, better is your chances of ranking in the SERPs. But it is not essentially true. Therefore in the next section, I have discussed some of the reasons why would you like to put a certain level of restrictions to the search engine bots crawling your site.
The Need of Robots.txt for Your Site
There are three major reasons for having a Robots.txt file for your site:
i) Search Engine Optimization
You are surely not going to get your site ranked better in the search engines by letting the bots crawl and index more and more pages of your site. Allowing the bots to index anything and everything in your site, can actually harm your sites rankings.
For instance, take the example of duplicate content problems in WordPress. WordPress is designed in a way for you to find content through multiple paths, like from the Categories, Tags, Date-based archives, Author archives or from the search feature in your site.
So many paths or URLs, all pointing to a single content can lead to huge duplicate content problems in your site. There have been many cases where websites got heavily penalised by the search engines only because of duplicate content problems in WordPress. It’s not the fault of WordPress, but it’s the way it has been structured and designed. It’s your duty to take care of your site in your own way.
There are some strong security reasons too for restricting the bots from accessing each and every corner of your site. There may be some places in your site storing private and confidential data which you never want to show up in the search engine results page.
To simplify things, there can be two types of robots – good robots and bad robots. If you give an unrestricted access of your site, then not only the search engines but also various bad bots will get a fine chance to access and steal confidential information from your site.
iii) Server Performance
Giving an unrestricted access to your site, can waste a huge bandwidth of your site and can slow down your site for the real users. There are various pages of your site which really need not to be indexed at all.
You may argue by saying that your webhost gives you “unlimited bandwidth”, but technically speaking, there is nothing called “unlimited”, aseverything has got its own limits. Not only would you be wasting your precious bandwidth and slowing down your entire site but also you will have an ample scope of overloading your server CPU.
There are many cases, where webmasters got suspended from their webhosts only because of this problem.
Virtual vs. Real Robots.txt in WordPress
Now, WordPress have got its own way of dealing with robots.txt. It has got functions to create a “virtual” robots.txt. By the word virtual, it means that there is no existence of this file in the server. The only way to access this file is from the URL, like t. If you open the root directory of your server via FTP, you won’t find any file named Robots.txt
But real robots.txt is something that you need to create and upload to the root directory of your server via FTP. Virtual robots.txt is basically a fake file which just serves the purpose of having a real one to some extent. WordPress creates this virtual file only after you publish your first post with WordPress (and not before that).
If you already have a real robots.txt file situated in your server, then WordPress won’t even bother to create another one for you. The virtual robots.txt that WordPress creates can essentially do two basic things for you– either it can grant full access to the search engines or it can completely restrict all of them (obviously the ones which obey the Robots.txt).
Understanding Robots.txt and its Directives
You cannot just copy and paste a basic template from some site and use it as your Robots.txt file. Every single site in the web is different, and the approach to create Robots.txt for them would also be different.
So you need to learn the way how Robots.txt and its various directives work. If you can understand (which is really easy to do), then you can create a fully customized Robots.txt file for your own site which will be perfectly optimized in terms of SEO, Security and Server Performance anytime.
The Basic Operations
You can either grant complete access to your site or block it entirely. It’s your choice entirely. If you have selected the option, “I would like my site to be visible to everyone, including search engines (like Google, Bing, Technorati) and archivers” from the Privacy Settings option in your WordPress Admin panel, then your Robots.txt would look something like this:
Or if you have selected “I would like to block search engines, but allow normal visitors”, then your Robots.txt should look something like this:
Disallow : /
If you notice carefully, then there is a slight but very important difference between the two codes which completely changes the behaviour of the two Robots.txt file.
The first one has got the “Disallow” part left blank, while the second one has got a “/” in it. The “*” in the “User-agent” section tells all search engines to follow these rules. The first code doesn’t disallow (or restrict) anything from any search engine bots, while the second one restricts everything from every robots.
Blocking the Essential WordPress Directories
There are three standard directories in every WordPress installation – wp-content, wp-admin, wp-includes. You certainly, don’t want to index the core WordPress files of your site which may contain many sensitive data of your installation too.
But the wp-content folder contains a subfolder, named as “uploads” which contains the media files of your site and there is no reason for you to block this folder. So, instead of directly blocking the entire wp-content folder, you have to use separate directives for it.
If you notice carefully, then there is a trailing slash at the end of every statement. There are many sites which never mention the importance of the trailing slash. Now the situation can be problematic sometimes, if your permalinks are set to “/%postname%/”.
Every post or page that starts with “wp-admin”, “wp-includes”, etc. will never be indexed. Many people don’t pay much attention to this mainly because there are very little chances of your posts starting with these obscure names.
Blocking on the Basis of Your Site Structure
Every single blog structure is different from others, and you need to realise and understand this as this thing alone is capable of creating huge duplicate content mess in your site.
Every blog can be structured in various ways and some of them are listed below:
i) On the basis of Categories
ii) On the basis of Tags
iii) On the basis of both Categories and Tags
iv) None of them
So, if your site is Category structured, then there is no need to get the Tag archives indexed. First you need to find your tag base, which you can find in the Permalinks options page under Settings menu. If the field is left out blank, then the tag base is simply “tag”.
If your site is Tag structured, then you need to block the Category archives. Just find your Category base just the way you did before and use this code to block them permanently.
If you use both Categories and Tags to structure your blog, then you need to do nothing at all and if you use none of them, then you need to block both of them by combining the two steps mentioned above.
Your site may also be structured on the basis of Date-based archives. This is another thing which can cause duplicate content issues for your site. You can block the date-based archives from being indexed in the following ways:
There is an important thing that you need to notice here. We have used separate directives to block date-based archives instead of using,
If we would have used this instead, then we would have blocked every single page or post which starts with “20”, which you certainly don’t want, as there are many posts which can have URLs starting with “20” like, “20-best-plugins-for-wordpress-seo”.
Blocking Files Separately
If you have any other files in your server which you don’t want the search engines to index, then you can block them too in the following ways:
Don’t get scared with the $ sign at the end as it is used for pattern matching. Many bots respects the use of “$” for matching the end of any URL.
Allowing Specific User Agents
It’s not always necessary that you would want to block bots from accessing your site. You may sometimes want some specific bots to crawl and index your site openly and independently. The “Allow” directive is your friend in this case.
For example, if you have taken part in the Google Adsense program, then you would certainly want the Adsense bots to access your site fully and retrieve any information they need. There is simply no reason to block these bots from accessing your site as you want to provide them with as much information they may need.
We have kept the “Disallow” part blank which tells the bot that we don’t want to restrict them from crawling anything. But that is not enough, as you also need to tell them that you want them to crawl everything, using the “Allow” directive.
Some Miscellaneous Blocking Instructions
There are various other miscellaneous things that you need to block in order to prevent duplicate content problems from occurring in your site.
You can block the search results page URL of your WordPress site, which looks something like this:
This search results page is generated automatically by WordPress and can lead to duplicate content issues anytime, so it’s better to block these URLs using,
Also consider blocking the RSS Feed URLs, as they are nothing but purely duplicate pages of your site. Use the following lines to block them permanently:
The same thing goes with Trackback URLS, which are also nothing but duplicate pages of your original posts. Don’t forget to block them too using,
A similar issue is caused by some print plugins in WordPress, which creates a print-friendly version of your posts and pages, which is a really very good feature in one way. But every good thing has got a bad side too.
This print-friendly version of your original posts and pages are nothing but a classic example of duplicate content. You must restrict them in the Robots.txt too, using:
Now, being armed with the essential information and knowledge about Robots.txt, you would be able to create a perfectly optimized Robots.txt file for your own site anytime. But if you have any queries related to this, feel free to ask me in the comments section.