Thunderstone Texis Categorizer
Overview
Thunderstone's Texis categorizer (also known as a classifier)
automatically attaches categories, subject codes, metadata, and the
like, to documents or text records.
Categories add value to full-text search systems in several ways:
- Sorting: Categories provide keys for sorting or grouping search results.
- Menus: They provide a "controlled vocabulary" that users can select from, instead of - or in addition to - trial-and-error searching.
- Browsing: They provide a finite set of hyperlinks that may be "navigated" as a means to browse thorough data in an organized fashion.
Customers generally begin with a "taxonomy," meaning predetermined
set of categories. For each category, one must provide "training" data,
consisting of example documents. However, the system is quite dynamic,
in that authorized users may create new categories as needed for new
types of content.
Each automatic category "recommendation" receives a statistical
confidence score. Operation may be either manual, automatic, or mixed.
In manual mode, an operator accepts or rejects each recommendation. In
automatic mode, categories are applied without user intervention. In
mixed operation, one designates a confidence score threshold, such that
recommendations above the threshold are accepted automatically, but
those below are held for human review.
Accuracy is a function of both the quantity and quality of the
examples. Categorization results approved or corrected by an operator
are fed back into the training base, helping the categorizer results
become even more accurate over time.
Hierarchical category schemes are easily accommodated. The
categorizer handles most European languages.
Architecture and Integration
The categorizer is an application of Thunderstone's Texis and Texis
Web Script products. The Texis underpinnings provide a broad range of
"hooks" for using the categorizer and tying it into other computer
applications. The system may be controlled through a web interface by
authorized individuals from any location. It may be interconnected to a
wide range of other information sources or repositories.
Interconnection may be accomplished by standard data interchange
mechanisms including FTP; HTTP with or without XML; ODBC; JDBC; or Perl
DBI/DBD. A "web services" application development model is supported. A
feature-rich C-callable API also is available.
The categorizer uses both the database and the search-engine
features of Texis. The documents are managed in a Texis SQL table. The
categories are added as updates to records in the table. The
Thunderstone relevance ranking algorithms are used to determine
"recommended" categories for each document.
From a database point of view, a category is a value in a column of
a SQL table. A categorization scheme may have multiple columns, and a
record may contain multiple values in a column. For example, a database
of business news may contain a column "industry" (with possible values
"energy"; "agriculture"; etc.) and a column "event" (with possible
values "merger"; "ipo"; etc.). These are represented in HTML by the
tagging convention <meta name=... content=... >.
Categories can be used in conjunction with the Texis thesaurus to
enhance text searching. For example, if your data contain documents
labeled with a "finance" category, you might designate "banking" as a
synonym of finance. Then one could set up a text-search application
such that a free-text query on "banking" would automatically return
records in the finance category, even if they don't mention
banking.
The categorizer is highly scalable. One typical server machine can
classify tens of thousands of documents daily. It may be set up to perform
categorization on new documents as soon as they are available
(real-time operation).
Contact us for more information.
|