Configuring Your Website for Speed, Performance and Maximum Visibility – Lesson 2: Robots.txt
Chinmoy Kanjilal on April 22, 2011 in General | No Comments
Why Is Robots.txt Important to Think About?
We have already done a piece on configuring Htaccess for your website and managing how the content on your website is accessed by users. This file Htaccess is the most powerful server-level customization you can have on your website. The Htaccess files can do a multitude of things, right from protecting files to speeding up your website. Want to know how? Check our compilation of Htaccess tricks and basic concepts that set you up to write a killer Htaccess file for your website.
The Htaccess file can control the content served by your website and it is good for your users who get to your content. However, getting users to your content involves a huge process of getting some visibility. This visibility can be from search engines, it can be from other referring websites or it can be from social media. For visibility from Social media, all you need is a killer title. For website referrals, you need to have a very good network and a niche focusing on a wide variety of websites. However, the most profitable and the reputable traffic you can have is from search engines. This is called “Organic Traffic“.
Understanding Organic Traffic and Indexing
Organic traffic is defined by Wikipedia as,
Web traffic which comes from unpaid listing at search engines or directories is commonly known as “organic” traffic. Organic traffic can be generated or increased by including the web site in directories, search engines, guides (such as yellow pages and restaurant guides), and award sites.
Clearly, this definition is only the tip of the iceberg. The web has changed a lot after Google came into the picture and has dictated the terms of search technology. Today, search is based on a well-defined algorithm, that is refined every few years and is continuously improved. Not only this, search depends on a very important factor which is: indexing.
Indexing is a process of going through your content and deciding on its usability and relevance. This finally helps Google and other search engines decide if your content is worthy of higher search engine ranks or it is not worth reading. Indexing by Google and reputed search giants is extremely important and so is indexing by other search engines. Thus, it is important to control the content which is indexed by search engines, and keep the content unchanged for your regular visitors at the same time.
What is Robots.txt and What is The Robots.txt Protocol?
The viewing rules for public pages on your website is always public. Though, you may want to regulate how they appear in search. Enter Robots.txt- the manager for all robots crawling your site.
A word of caution here, Robots.txt is a method of preventing co-operative search bots and crawlers from indexing content wrongly, but the key here is that the bot has to be co-operative. You cannot control a rouge bot with Robots.txt.
Robots.txt is a completely opt-in process to control the indexing of your web-content and finally, ends up in better archiving and categorizing of your website. This is how, you end up getting some pretty sitelinks too.
Your robots.txt files always resides in the web-root of your website. So, place it under the www or public_html or any relevant folder that is the root for your website. This webroot is the page that opens with your domain, like www.theaggressive.com for this domain. Thus, you can see the robots.txt file for this website at www.theaggressive.com/robots.txt. This is possible because the robots.txt file is always public.
Robots.txt Rules
Robots.txt understands one basic ruleset. Exclusions. You can configure suggestions to exclude some pages on index, but they will still be accessible by regular and direct visitors.
A Robots.txt consists of a list of user-agents and a list of files and directories that are either included or excluded. Apart from this, you can also specify the sitemaps of your website in Robots.txt.
Specifying a Sitemap
To specify a sitemap path in your robots.txt file, use the following code.
1 | sitemap: http://yourwebsite.com/sitemap.xml |
This helps the search engine look for the sitemap which is another robots inclusion protocol and helps search engines keep track of your website content. You can specify multiple sitemaps in your robots.txt file though, all of them must have the correct access pemissions (public) to be viewed by search engines.
Adding Comments
Some people are in the habit of adding comments. Well, robots.txt allows for comments too. Comments should begin with a ‘#’ symbol and they are usually single line. The below example shows how you can use comments in robots.txt.
Sitemap for products site
1 | sitemap: http://yourwebsite.com/products/sitemap.xml |
Basic Robots.txt structure
Both these codes have a seemingly same effect, but the first one is faster.
1 2 | User-agent: * Disallow: |
1 2 | User-agent: * Allow: * |
This is the basis for all rules defined in the robots.txt file. Any further, we will build upon it.
Exclude ALL Robots From Crawling
To exclude all robots from crawling any content within a path on your site, use the rule given below. The first example keeps any robot out of your entire website and the second example keeps any robot out of your products directory on the site.
1 2 | User-agent: * Disallow: / |
1 2 | User-agent: * Disallow: /products/ |
Regulating Crawler Activity on Server
Sometimes, it is wise to regulate the frequency of crawlers on your site. You can set this rule for delaying crawlers from updating content on your site and overloading your servers. However, you should use this wisely as it may cause conflicting delays resulting in prolonged delays of some relevant crawlers. A better option is to provide a crawler’s name to delay it.
1 2 | User-agent: * Crawl-delay: 10 |
Disallowing Only Part of The Site
Say you have five pages in a folder and want to hide one from the crawler. This is easy as a piece of cake. You must remember to set an allow rule first, then disallow the rest of the folder.
The rules should look as below.
1 2 3 | User-agent: * allow: /products/product_list.html disallow: /products/ |
This will effectively allow robots to index a sneak-peek at what you are offering but requires them to come to your website for any further information. This is exactly what your strategy should be.
A List of Search Bots
Applying rules on search bots involves knowing their names as an essential part of the process. The most popular global search bots are:
- Googlebot
- Bingbot
- Lycos
- Monster
- MSNBot
The Verdict!
Finally, I would like to mention a word of warning here. The rules we specify are only suggestions and a rouge bot will not follow them at all. Apply these rules and you will see your site appearing better in search with properly indexed content pages.
That was all about the Robots exclusion protocols. Next, we will look into a Robots inclusion protocol, the Sitemap that can quite well be the most important thing a search crawlers sees on your site. Keep checking back for more, and do not forget to check out our earlier post in this series on Configuring the .htaccess file.
The Complete Series List:
- Configuring Your Website for Speed, Performance and Maximum Visibility – Lesson 1: Htaccess
- Configuring Your Website for Speed, Performance and Maximum Visibility – Lesson 2: Robots.txt


Discussion | No Comments