Wednesday, July 29, 2009

Hey do you know about robots.txt?!!!!

What are Robots.txt files ?
Robots.txt file is a text file (yea, the ones on notepad) that resides on your server, and controls a whole lot of features on your website (whatever platform it is built on). It's a simple text file in which there are a few lines of text, but it's very powerful that it can even decide whether your website should be shown on Google or not, what part of your website should be shown to the search engines (like Google, Yahoo and MSN).

What is a Robots.txt file technically?

In order to understand what robots.txt files are, you have to first understand what a Robot (the web one) is.

A robot - is technically a program from search engines like Google, Yahoo and MSN that are set out on the internet to do the job of finding out new websites, indexing them and gathering the right information about the website. They are sometime called "spiders", "crawlers" and even "bots".

Where do the Robots come from?

Robots are commonly set out by search engines like Google, Yahoo, MSN, Altavista, Ask.com and others. Mainly, these are web servers of the search engines, that are on the constant look out of information on the internet. And they gather information (which ultimately goes to the search engines index) by visiting new websites, gathering up new information from them, following links and calculating and analyzing a whole lot of information from them.

What do Robots do?

Robots mainly performs four types of tasks.


* Site Indexing - Which is more like taking a copy of a new website it finds and storing it in some location at the search engines servers. This is accomplished by scanning the documents on a website and mirroring them to temporary servers.

* Validates the site code - Which is more like comparing the website code to W3C standards and grading them according to accuracy.

* Link Checks - Which includes tracing all possible links (incoming and outgoing) from indexed websites, and calculating the sites grading factors such as authority, relevance etc.

What does a Robots.txt file do?

Robots.txt file gives commands to the visiting robots (on the website) to help them index and collect relevant information about the website.

It's more like the helpdesk, which will give all information, guidance and help to the visitors at an event about how to reach the venue, important places, time schedule, map etc.

The commands on the robots.txt file is completely configurable by the webmaster.

Using the right commands, a webmaster can decide everything related to search engines like what search engines are allowed into the website, what is the information available to them, what are the documents that are not available for the search engines and even pass information like how often are pages added to the website and how often should the robots visit them.

Where to spot the Robots.txt file?

The Robots.txt file is located at the root folder of your website. This is most often the _public-html or the http-docs folder. Root folder means the top most directory on the website that is accessible to the public.
It is critical to place the Robots.txt file in the root folder. Placing it elsewhere will not make it functional.

Why is Robots.txt file and Robots important to a webmaster?

Well, for a webmaster Robots.txt should be important because, it helps ensure better indexing of their websites, which means more information passed to search engines and thereby better search engine ranks for them.

It is possible for the webmaster to decide how their websites should be crawled, indexed and ranked by the search engines by the use of well-written Robots.txt files. So, it gives them complete (well almost) control over how a search engine "sees" their websites, which is very crucial.

How does a Robots.txt file look like?
User-agent: *
Allow: /searchhistory/
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
Disallow: /news
Disallow: /nwshp
Allow: /news?btcid=
Disallow: /news?btcid=*&
Allow: /news?btaid=
Disallow: /news?btaid=*&
Disallow: /setnewsprefs?
Disallow: /index.html?
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/

If you like to see more Robots.txt files, just type in the domain name followed by the /robots.txt filename in the browser of any website you like, and if they are using a Robots.txt file, then it would show up.(Ex: www.google.com/robots.txt, www.yahoo.com/robots.txt)