THUNDERSTONE NEWS
CONTENTS
New Asian Language Partnership
When creating applications that deal with Asian languages such as
Chinese or Japanese, there is more to consider than just being able to
store and search Unicode. Other issues include optimizing the search
syntax, the user interface, and even the marketing -- in other words,
globalization of the entire application.
Thunderstone has partnered with a leading company providing
localization and globalization services. The partner has valuable
experience in integrating Texis. If you need to use Texis in Asian
language applications, we encourage you to
contact us to learn more.
AOL Selects Texis for Classifieds
Thunderstone is proud that America Online has selected
Thunderstone's Texis software to power a new online classified
advertising system. The Texis-based service already is used for storing
and searching the Monster job listings on the AOL service. Other
classified categories are being market tested.
Texis was selected for its efficient and versatile handling of
complex search criteria. Each potential classified category has a
different structure -- for example, "Real Estate" needs
fields for ZIP code and number of bedrooms. Texis allows developers to
easily apply either text-search or database-query logic, or both
together, to any set of information. Texis's real-time updating and
SQL-oriented application development tools also influenced the
choice.
File Format Filter Enhancements
The File Format Plug-in (anytotx) handling of Microsoft Office
documents has been updated in a variety of ways. For example, it now
extracts document titles and other meta information, and it does a
better job finding text in non-Word files (PowerPoint, Excel, etc.)
By the way, we often are asked whether our products can index
this-or-that unusual file format. For example, customers recently asked
about indexing engineering drawings and news photographs. There answer
is very often yes, even though we may never have dealt with the
particular format. That's because the anytotx program is designed to
examine even file types it doesn't know, and select out any ASCII text
it finds, skipping over the non-ASCII (binary) data. Comments, photo
captions, etc. are commonly stored in ASCII within these types of
files.
Even if the pertinent information is not ASCII, another enhancement
may be of help. We have made it easier to add an external filter to
anytotx. This can be a third-party program not necessarily designed
for use with Texis or Webinator. For more information about how to
accomplish this, please ask Thunderstone
Tech Support.
The anytotx enhancements are part of the Webinator and Texis
distributions beginning with version 4.3.8. Texis maintenance customers
or those with Webinator paid versions 4.0+ may request an update that
includes the new plug-in from
Tech Support.
Other customers may obtain the new plug-in by upgrading Webinator or
joining Texis Maintenance. Search appliance customers should go to the
"Maintenance" menu on the admin page and select "Check
for updates."
Thunderstone Wins Award
We're happy to mention that Thunderstone's Webinator won the
Cleveland Area Knowledge Industry Best Software Product award. By the
way, our headquarters city of Cleveland, Ohio, USA, and the nearby area
is becoming somewhat of high-tech hotbed, so we feel way ahead of the
times in setting up shop here 22 years ago! As in other tech hot spots,
first-rate research universities seed the region with talent. (If
you're a programmer with computer science degree, seeking a position in
Northeast Ohio, please
contact us!)
Tech Corner: Date Sorting Anomalies
in Web Site Searching
Although Thunderstone's Webinator search script offers users a
"sort by date" option, its sorting is only as good as the
dates it gets from each web site it indexes. We see many sites where
the dates are spurious, resulting in incorrect search result ordering.
In the web walker script (dowalk), the date stored for a web page is
that returned by the visited web server, as the HTTP header's
Last-Modified value. The visited web server in turn usually gets that
information from the date stamp of the html file, as reported by its
file system.
But there are a variety of ways that file system date stamps can be
misleading. For example, when you copy files to your web server by ftp,
the operating system probably will assign the current date to each
file. If your intention was to just copy an older file, you lose the
original date stamp, and the page will be reported as new instead.
We have seen news sites where almost every page contained an
identical Last-Modified date, although the content of those pages
consisted of news articles spanning months or years, which had not
changed since they were published. Over time, the files were moved from
machine to machine, and the original document dates were lost.
Incorrect dates also commonly are seen on dynamically generated
pages. Such pages usually are based on content extracted from a
database. The database might hold correct dates, but the programmer for
that site neglected to incorporate logic for using them, so a default
value gets assigned instead.
If your web pages are dated correctly, pay attention to keeping them
that way! If you need to move or copy files, there are procedures that
will preserve the date, such as backup and restore.
If the web pages or documents you want to index already have
incorrect dates, you have these options to fix them. If correct dates
are available within the documents themselves (perhaps as a meta
value), you can extract those and substitute them for the HTTP default
value during a walk. That requires customizing the dowalk script. With
full Texis, you also have the option to directly change the dates
stored in the Webinator database, either manually or by a new script
containing whatever logic you want to apply.
Feedback, suggestions and questions are welcome to