Note: This documentation is for an old version of Webinator. The latest documentaion is here.

Customizing the Walker

 

You may make many changes to Webinator's walk behavior by using Walk Settings from the administrative interface main menu. But you are not limited to these features. You may change any and all aspects of the walker's behavior by modifying the supplied dowalk and/or webinatoradmin script.

For details on programming with Texis Web Script (Vortex), see the manual at the Thunderstone web site, http://www.thunderstone.com.

The following describes some important points about the internals of the dowalk script that comes with Webinator. The dowalk script is fairly heavily commented to aid in finding your way around within it.

The dowalk script actually consists of 2 vortex script files. dowalk contains the walker/indexer and settings reading code. It includes the second file which is a vortex source module called webinatoradmin and must be in the same directory as dowalk. The webinatoradmin module provides the management interface that is used from a web browser.

Dowalk is not compatible with old-style gw databases. It can, however, be made compatible. There are comments throughout the script containing the word ``COMPATIBILITY'' that indicate where and what kind of changes to make. The most significant differences are the addition of several fields to the html table and the keeping of the leading http:// on URLs in the database.

The dispatch function is the primary external entry point for performing a new walk. It load settings, sets up logging and databases, then invokes other processes in parallel (according to maximum servers setting). When all of the walking is complete it removes commonality from pages (if that option is set), creates the indices needed for searching the database, then makes the new database live and deletes the old database.

The refresh function is the entry point for a refresh walk (as opposed to new). It sets up logging and such then called dorefresh to do the work. dorefresh loops over all URLs in the database and queries the webserver if there is a newer version to download. Any new hyperlinks found on downloaded pages are also downloaded and added to the database. Any pages that return error will be deleted from the database.

The stop function is an external entry point that is used to signal (using <loguser>) a walk that is in progress that it should stop. The walkers check for this signal (using <userstats>) at various points and will quit when it is detected.

The reindex function is an external entry point that is used to drop and recreate the Metamorph index on the html table. This is needed after changing the word definition expressions.

The remakeindex function is an external entry point that is used to drop and recreate all indices on the database. It it only for use if one or more non-Metamorph indices get corrupted by disk errors or such.

The recat function is an external entry point that is used to recategorize the html table based on the current (presumably changed) categories.

The ifmodified function is an external entry point that is used to tell the dispatcher to run only if chkneedwalk indicates a walk is needed.

The usage function is called when you invoke dowalk incorrectly and prints a terse summary or correct usage options.

The doplugin function handles files that are not HTML or text, such as PDF and MSWord. It determines the correct options for anytotx based on the fetched page's mime type or extension. It then calls the dofilt function which actually runs anytotx to perform the conversion to text and the extraction of meta information such as Title. It will make up a title for the document if none is returned by anytotx.

The settings function calls the defaults, readsettings, and applysettings functions, in order. This function is called by most entry points to get default and current settings for a given profile before proceeding with any work.

The updatemmindex function is called (sometime after having called settings) to create or update the Metamorph index on the html table.

The maketables function is called (sometime after having called settings) to create all of the Webinator tables. This function does nothing for Webinator-only licenses. For Webinator-only licenses the tables are created automatically by Texis when the database is created. The schema may not be changed. If you want to modify dowalk to work with gw style databases, you will need to create the database and tables with gw before running dowalk.

The walk function is the core which walks all desired URLs on a single site. It always processes breadth first (ie it gets all URLs at a given depth before proceeding to the next level down). Any desired URLs that reside on a different site are placed into the database's todo table for processing by the dispatcher.

The fetchset function is used in various places to fetch one or more URLs (using the maximum threads setting) simultaneously.

The manglepage function is called before extracting text and hyperlinks from an HTML page. It allows the page to be modified before processing. This is where the ignore/keep tags are handled.

The getrobotstxt function fetches the robots.txt file from a given site and checks for any exclusions for webinator. These exclusions are later added to the list of url rejection patterns.

The chkneedwalk function is called to check if a rewalk is required. It fetches the page to see if the modification date has changed. Or, if the web server does not provide a modification date it compares the content to what it was previously. It sets an internal flag if a rewalk is needed.

The putmsg function intercepts error messages to provide special handling for some, and recording of most.

The go function is an external entry point used by the dispatcher when it starts up child processes to walk a specific site or set of URLs.

The singles function is an external entry point that is used to fetch all of the single page URL. It is called by the dispatcher as the first parallel process. Therefore single pages will generally be fetched earliest in a new walk.

The rmlocks function is used to remove any stale locks and monitor processes on a database and dismantle the locking structure. This is done before physically removing a database from the system.

The geturl function is a utility function that may be used to find out what the walker will think about a given URL using the current walk settings. It is invoked as follows:

   texis profile=PROFILE top=THEURL dowalk/geturl.txt

This can generate a lot of output for a page of any size so you may want redirect it to a file that you can examine with your favorite viewer/editor.

   texis profile=PROFILE top=THEURL dowalk/geturl.txt >SOMEFILE.txt

The getrobots function is a utility function that may be used to find out what the walker will think about a given robots.txt using the current walk settings. It is invoked as follows:

   texis profile=PROFILE top=THEURL dowalk/getrobots.txt

This can generate a lot of output for a page of any size so you may want redirect it to a file that you can examine with your favorite viewer/editor.

   texis profile=PROFILE top=THEURL dowalk/getrobots.txt >SOMEFILE.txt


Copyright © Thunderstone Software     Last updated: Tue Nov 6 10:58:37 EST 2007
Copyright © 2024 Thunderstone Software LLC. All rights reserved.