If there are any HTML trees that you don't want indexed you need
to setup a robots.txt
file or use the -x
option.
For example; if you had a ``text only'' version of your web server that
duplicated the content of your normal server you would not want to
index it. (On the other hand if most of your meaningful text
is contained in graphics you may want to walk the text tree instead
of the normal one since graphics are not searchable.)
Suppose your ``text only'' pages were all under a directory
called /text
. The simplest way to prevent traversal of that
tree would be to use the -x option
, as in:
gw -xhttp://www.mysite.com/text/ http://www.mysite.com
That will prevent retrieval of any pages under the /text
tree.
It would get tedious and error prone to enter the same thing every
time you indexed your site. That also does not prevent other Web robots
from retrieving the /text
tree. To setup a permanent global
exclusion list you need to create a file called robots.txt
in
your document root directory. The format of that file is as follows:
User-agent: *
Disallow: /text
Where *
is the name of the robot to block. *
means any
robot not specifically named (all robots in this case since no others
are named). Or you could specify the name of the robot. For Webinator
it would be Webinator
.
You may
specify several ``Disallow''s for any given robot.
You may also specify different ``Disallow'' sets for different robots. Simply insert a blank line and add another ``User-agent'' line followed by its ``Disallow'' lines.
Here's a larger example:
User-agent: *
Disallow: /text
Disallow: /junk
User-agent: Webinator
Disallow: /text
Disallow: /webinator
User-agent: Scooter
Disallow: /text
Disallow: /junk
Disallow: /big
The Scooter
robot will be blocked from accessing any pages under
the /text
, /junk
, and /big
trees. Webinator
will be blocked from accessing any pages under /text
and
/webinator
. All other robots will be blocked from accessing
pages under /text
and /junk
.
Use of robots.txt
is not enforced in any way. Robots may or may
not use it. gw
will, by default, always look for it and use it
if present. This may be disabled with the -r
option. When using
-r
you may still use -x
for manual exclusion.