THUNDERSTONE NEWS
CONTENTS
Texis Version 5.0
Texis version 5.0 contains many improvements. Most of the new
features are in the crawler utilities ("dowalk" and "search" scripts)
that play a core role in both Webinator and the Search Appliance
products. The scripts also are provided for optional use or
modification with full Texis distributions. Separately, Texis and
Vortex have a variety of optimizations and extensions.
All products have these enhancements:
- Adaptive Indexing. Much faster refresh crawls, with
reduced server load.
- Spelling checker. Suggests alternate searches,
automatically customized to your site.
- Pause/resume walks. Stop an in-progress walk, resume
it later.
- Unicode/UTF-8 support. Also international
characters.
- Resource limits. Pauses walk if memory usage etc.
exceeds settable limits.
- Multiple user/passwords. Different base URLs with
different logins, for multiple users.
- Exclude-by field. Flexible exclusion based on text,
meta-data, keywords etc.
Adaptive Indexing is especially significant. It keeps site and
enterprise indexes fresher than ever before. The crawler revisits each
page or document according to a separate schedule -- the more often the
page changes, the more frequently it visits. The software estimates how
often each page or document needs to be re-crawled based on its change
history. Pages may be reindexed as often as every minute!
Note that the spelling-checker lexicon is not based on a dictionary.
Rather, it is made up of the words contained in each database, i.e., in
the index of a document or web page collection (called a profile in the
crawler admin module). If you search for a word not in the index, the
spell checker suggests the closest matches existing in the index. For
this reason the spelling checker works automatically with site-specific
terminology, or even in languages other than English!
Texis maintenance customers will receive version 5 automatically
according to the semi-annnual update schedule, but can request a
release sooner if needed. Webinator upgrade pricing is available at the
Webinator product page. Search Appliance customers should use the
Maintenance - Check for Updates feature periodically to be kept up to
date.
Search Appliance Update Note
After updating the Search Appliance to version 5, any scheduled
refresh crawls will need to be launched manually once to do a complete
crawl, and will then revert back to the regularly scheduled refresh.
Otherwise a schedule refresh may do a complete new crawl
unexpectedly.
Note that if you have selected "Automatically Discover, Download,
and Install Updates" on the Maintenance - Update Preferences page, your
appliance should be updating itself to v5 on Monday, August 23. Please
follow the procedure above on that day.
"Best Bets" Feature
Customers sometimes ask how to make certain records "always come to
the top" in search results. Thunderstone's new Best Bets feature makes
that easy to accomplish.
Best Bets is analogous to the "sponsored listings" at the top of
search results on Yahoo and its competitors. You may be familiar with
how sponsored listings work: the advertiser stores a list of phrases,
each associated with a web page. When one of those phrases matches a
user's query, the corresponding page link is displayed first, or set
apart in a special area.
If you administer a Thunderstone search application for your own
organization or web site, Best Bets allows you to become an
"advertiser" on your own site. (And without paying advertising
fees!)
This tends to come in handy for larger collections. Suppose you
determine, for example, that users should see a certain document first
whenever they search for the term "repair service". However, many
documents in your collection may discuss repair service, and thus match
such a query. Just go to the Best Bet settings page, and store that
phrase along with the desired document link. Then you are assured that
it will appear as the top search result. You also can set highlighting
color, page placement, and various other characteristics to make it
stand out.
Best Bets also is included in the Search Appliance and with full
Texis distributions, and as an extra cost option with some Webinator
versions.
Tech Corner: Grouped Search Results
If you're a Texis developer, how often do you use the SQL
GROUP BY clause? It is a very useful feature in text
searches, but it tends to be neglected!
If you have developed SQL applications in the past, you
probably used GROUP BY
in connection with numeric data. But GROUP
BY can be very helpful in many cases even with no
numeric values. The typical case we see involves tables
containing a text field along with one or more category or
"metadata" fields.
For example, a customer of ours that operates an industrial
laboratory has a database of research reports. At first their Texis Web
Script search application had the simple form of
SELECT Title, Author, Date FROM Reports
WHERE Text LIKE $query
However, when hundreds or even thousands of document titles are
returned, users need a more refined query to narrow the search. Our
customer observed that each researcher tends to create reports that are
similar or are about related topics. That made it useful to first see a
list of which researchers had published anything about that
query. They accomplish it with this:
SELECT Author, COUNT(Author) FROM Reports
WHERE Text LIKE $query GROUP BY Author
The results would be formatted to look something like this:
| Author |
count(Author) |
| Anderson, William |
12 |
| Johnson, Mary |
22 |
| Smith, Robert |
8 |
Where a query has hundreds of matching documents or more, this query
tends to return a more manageable list. When the user clicks on a link,
the application displays the records that that author published
containing that search term:
SELECT Title, Date FROM Reports
WHERE Text LIKE $query AND Author = $Author
The user can examine those documents and if they are similar,
go back see what the next author published.
Feedback, suggestions and questions are welcome to