Page Exclusion and Robots.txt

Note: This documentation is for an old version of Webinator. The latest documentaion is here.

Page Exclusion and Robots.txt

If there are any HTML trees that you don't want indexed you need to setup a robots.txt file or use the -x option. For example; if you had a ``text only'' version of your web server that duplicated the content of your normal server you would not want to index it. (On the other hand if most of your meaningful text is contained in graphics you may want to walk the text tree instead of the normal one since graphics are not searchable.)

Suppose your ``text only'' pages were all under a directory called /text. The simplest way to prevent traversal of that tree would be to use the -x option, as in:

gw -xhttp://www.mysite.com/text/ http://www.mysite.com

That will prevent retrieval of any pages under the /text tree. It would get tedious and error prone to enter the same thing every time you indexed your site. That also does not prevent other Web robots from retrieving the /text tree. To setup a permanent global exclusion list you need to create a file called robots.txt in your document root directory. The format of that file is as follows:

User-agent: *
Disallow: /text

Where * is the name of the robot to block. * means any robot not specifically named (all robots in this case since no others are named). Or you could specify the name of the robot. For Webinator it would be Webinator. You may specify several ``Disallow''s for any given robot.

You may also specify different ``Disallow'' sets for different robots. Simply insert a blank line and add another ``User-agent'' line followed by its ``Disallow'' lines.

Here's a larger example:

User-agent: *
Disallow: /text
Disallow: /junk

User-agent: Webinator
Disallow: /text
Disallow: /webinator

User-agent: Scooter
Disallow: /text
Disallow: /junk
Disallow: /big

The Scooter robot will be blocked from accessing any pages under the /text, /junk, and /big trees. Webinator will be blocked from accessing any pages under /text and /webinator. All other robots will be blocked from accessing pages under /text and /junk.

Use of robots.txt is not enforced in any way. Robots may or may not use it. gw will, by default, always look for it and use it if present. This may be disabled with the -r option. When using -r you may still use -x for manual exclusion.