Categorize documents or text records to add value to full-text search systems

Texis Categorizer—also known as a classifier—automatically attaches categories, subject codes, metadata and more to documents or text records. The categorizer is an application of the Texis platform and Texis Web Script (Vortex) product. The Texis underpinnings provide a broad range of "hooks" for using the categorizer and tying it into other computer applications. For added flexibility, the categorizer handles most European languages.

Manual, Automatic or Mixed Operation

Each automatic category "recommendation" receives a statistical confidence score. Operation may be manual, automatic or mixed. In manual mode, an operator accepts or rejects each recommendation. In automatic mode, categories are applied without user intervention. In mixed operation, one designates a confidence score threshold so that recommendations above the threshold are accepted automatically and those below are held for human review.

Enhance Text Searching

The benefit categories bring to full-text search systems include:

  • Sorting: keys for sorting or grouping search results.
  • Menus: provide a “controlled vocabulary” that users can select from, instead of, or in addition to, trial-and-error searching.
  • Browsing: a finite set of hyperlinks can be "navigated" as a means to browse through data in an organized fashion.

The classification results can be stored in the database or drive further processing. Customers generally begin with a taxonomy, or pre-determined set of categories, but authorized users can create new categories as needed through the dynamic system.

scale icon

A Highly Scalable Classifier

Because Texis Categorizer is highly scalable, one typical server can classify tens of thousands of documents daily. And, it can perform real-time operation by performing categorization on new documents as soon as they are available.

More Accurate Search

Accuracy is a function of both the quantity and quality of the examples. Categorization results approved or corrected by an operator are fed back into the training base, helping the categorizer results become even more accurate over time. In addition, hierarchical category schemes are easily accommodated.

Architecture and Integration

Texis Categorizer can be controlled through a web interface from any location, as well as interconnected to a range of other information sources or repositories by standard data interchange mechanisms, including:

  • FTP
  • HTTP with or without XML
  • ODBC
  • JDBC
  • Perl DBI/DBD. A "web services" application development model is supported. A feature-rich C-callable API also is available.

The Texis Categorizer uses both the database and the search-engine features of Texis, with documents managed in a Texis SQL table. Categories are added as updates to records in the table with relevance ranking algorithms used to determine recommended categories for each document.

From a database point of view, a category is a value in a column of a SQL table. A categorization scheme may have multiple columns and a record may contain multiple values in a column. For example, a database of business news may contain a column "industry" (with possible values "energy"; "agriculture"; etc.) and a column "event" (with possible values "merger"; "ipo"; etc.). These are represented in HTML by the tagging convention .

Categories can be used in conjunction with the Texis thesaurus to enhance text searching. For example, if your data contain documents labeled with a "finance" category, you might designate "banking" as a synonym of finance. Then, set up a text-search application such that a free-text query on "banking" would automatically return records in the finance category, even if they don't mention banking.