|
The default rewalk type Refresh updates the existing database,
and only downloads files that have been modified or created since the
last walk. Pages that are no longer present on the server are removed
from the database.
Here are other considerations for using Refresh. Pages that were referenced but were missing in the initial walk (the walk prior to the Refresh), but were added after the initial walk, will be missed by Refresh if their
parent page has not been modified. If you change your settings to be
more inclusive (ie add extensions, ignore robots, add domains, etc.),
you should do a New walk once, because a Refresh is not
likely to find the newly allowed data, unless all of the pages leading
to this data have been modified.
If more than 30%-50% of your site changes between walks you may be
better off using a New walk instead of Refresh. Also,
many dynamic content generators do not give modified dates which will
cause every page to be rewalked. In that case you should use
New instead of Refresh.
Copyright © Thunderstone Software Last updated: Thu Dec 22 14:38:01 EST 2011
|