You may make many changes to
Webinator's
walk behavior by using
Walk Settings
from the administrative interface main menu.
But you are not limited to these features. You may change any and all
aspects of the walker's behavior by modifying
the supplied dowalk
and/or webinatoradmin
script.
For details on programming with Texis Web Script (Vortex), see the
manual at the Thunderstone web site, http://www.thunderstone.com
.
The following describes some important points about the internals of the dowalk script that comes with Webinator. The dowalk script is fairly heavily commented to aid in finding your way around within it.
The dowalk script actually consists of 2 vortex script files. dowalk
contains the walker/indexer and settings reading code. It includes the second
file which is a vortex source module called webinatoradmin
and must
be in the same directory as dowalk. The webinatoradmin module provides
the management interface that is used from a web browser.
Dowalk is not compatible with old-style gw databases. It can, however, be
made compatible. There are comments throughout the script containing the
word ``COMPATIBILITY'' that indicate where and what kind of changes to make.
The most significant differences are the addition of several fields to the
html table and the keeping of the leading http://
on URLs in the
database.
The dispatch
function is the primary external entry point for
performing a new walk. It load settings, sets up logging and databases,
then invokes other processes in parallel (according to maximum servers
setting). When all of the walking is complete it removes commonality
from pages (if that option is set), creates the indices needed for
searching the database, then makes the new database live and deletes
the old database.
The refresh
function is the entry point for a refresh
walk
(as opposed to new
). It sets up logging and such then called
dorefresh
to do the work. dorefresh
loops over all URLs
in the database and queries the webserver if there is a newer version
to download. Any new hyperlinks found on downloaded pages are also
downloaded and added to the database. Any pages that return error will
be deleted from the database.
The stop
function is an external entry point that is used to
signal (using <loguser>
) a walk that is in progress that it
should stop. The walkers check for this signal (using
<userstats>
) at various points and will quit when it is detected.
The reindex
function is an external entry point that is used
to drop and recreate the Metamorph index on the html table. This is
needed after changing the word definition expressions.
The remakeindex
function is an external entry point that is used
to drop and recreate all indices on the database. It it only for use if
one or more non-Metamorph indices get corrupted by disk errors or such.
The recat
function is an external entry point that is used to
recategorize the html table based on the current (presumably changed)
categories.
The ifmodified
function is an external entry point that is used to
tell the dispatcher to run only if chkneedwalk indicates a walk is needed.
The usage
function is called when you invoke dowalk incorrectly
and prints a terse summary or correct usage options.
The doplugin
function handles files that are not HTML or text,
such as PDF and MSWord. It determines the correct options for anytotx
based on the fetched page's mime type or extension. It then calls the
dofilt
function which actually runs anytotx
to perform
the conversion to text and the extraction of meta information such as
Title. It will make up a title for the document if none is returned
by anytotx
.
The settings
function calls the defaults, readsettings,
and applysettings functions, in order. This function is called by
most entry points to get default and current settings for a given
profile before proceeding with any work.
The updatemmindex
function is called (sometime after having
called settings) to create or update the Metamorph index on the html table.
The maketables
function is called (sometime after having
called settings) to create all of the Webinator tables. This function does
nothing for Webinator-only licenses. For Webinator-only licenses the
tables are created automatically by Texis when the database is created.
The schema may not be changed. If you want to modify dowalk to work with
gw style databases, you will need to create the database and tables with
gw before running dowalk.
The walk
function is the core which walks all desired URLs on a
single site. It always processes breadth first (ie it gets all URLs at
a given depth before proceeding to the next level down). Any desired
URLs that reside on a different site are placed into the database's
todo table for processing by the dispatcher.
The fetchset
function is used in various places to fetch one
or more URLs (using the maximum threads setting) simultaneously.
The manglepage
function is called before extracting text and hyperlinks
from an HTML page. It allows the page to be modified before processing.
This is where the ignore/keep tags are handled.
The getrobotstxt
function fetches the robots.txt file from a
given site and checks for any exclusions for webinator. These exclusions
are later added to the list of url rejection patterns.
The chkneedwalk
function is called
to check if a rewalk is required. It fetches the page
to see if the modification date has changed. Or, if the web server does
not provide a modification date it compares the content to what it was
previously. It sets an internal flag if a rewalk is needed.
The putmsg
function intercepts error messages to provide special
handling for some, and recording of most.
The go
function is an external entry point used by the dispatcher
when it starts up child processes to walk a specific site or set of URLs.
The singles
function is an external entry point that is used to
fetch all of the single page URL. It is called by the dispatcher as
the first parallel process. Therefore single pages will generally be
fetched earliest in a new walk.
The rmlocks
function is used to remove any stale locks and monitor
processes on a database and dismantle the locking structure. This is
done before physically removing a database from the system.
The geturl
function is a utility function that may be used to find
out what the walker will think about a given URL using the current walk
settings. It is invoked as follows:
texis profile=PROFILE top=THEURL dowalk/geturl.txt
This can generate a lot of output for a page of any size so you may want redirect it to a file that you can examine with your favorite viewer/editor.
texis profile=PROFILE top=THEURL dowalk/geturl.txt >SOMEFILE.txt
The getrobots
function is a utility function that may be used to find
out what the walker will think about a given robots.txt using the current walk
settings. It is invoked as follows:
texis profile=PROFILE top=THEURL dowalk/getrobots.txt
This can generate a lot of output for a page of any size so you may want redirect it to a file that you can examine with your favorite viewer/editor.
texis profile=PROFILE top=THEURL dowalk/getrobots.txt >SOMEFILE.txt