Tokenization and Inverted Files

Two leading techniques in text retrieval that have heretofore been used have been file inversion and tokenization. The dominant problem with both of these techniques is that they require modification of the original data to be searched in order for it to be accessible to the data retrieval tool. The second problem, which has deeper ramifications, is that in order to perform file inversion or tokenization such programs make certain predisposed determinations about the lexical items that will be later identified.

How good such programs are will depend in great part upon their ability to identify and then locate specified lexical items. In most cases the set of lexical items identified by the inversion or tokenization routines is simply insufficient to guarantee the retrieval of all things that one might want to search for. But even where the set of identifiable lexical items is reasonably good, one always has a certain basic limitation to contend with: one will never be able to locate a superset of the lexical item listed in the look up table.

It is for these reasons that when we make use of indexing techniques, we supplement them with a final linear text read where required. Many content oriented definitive searches must contain a linear read of context to make accurate determinations involving relevance.

In systems where a lookup table exists containing either file pointers or tokens, context is missing. You cannot search for something next to something else, as no adequate record of related locations of items is contained in the lookup table. You may yet find what you are looking for, but you may have to convert the file to its original form before you can do so.

It will be hard to find a program which stores any, let alone all, possible combinations of words making up phrases, idiomatic expressions, and acronyms in the lookup table. While you can look for "Ming", or you can look for "Ling", you cannot directly look for "Ming-Ling". Another tricky category is that of regular expressions involving combinations of lexical items. If you are searching the Bible by verse, you want to find a pattern of "digit, digit, colon, digit". This cannot be done when occurrences of digits are stored separately from the occurrences of colons.

Our own database tool Texis has a modifiable lexical analyzer, which goes further than other indexing programs to attend to this problem. However, a linear free text scan still gives the maximum flexibility for looking for any type of pattern in relation to another pattern, and therefore has been included in Texis as part of the search technique used to qualify hits.

Metamorph is the search engine inside of Texis which contains maximum capability for identification of a diverse variety of lexical items. No other program has such an extended capability to recognize these items. We can look for special expressions by themselves or in proximity to sets of concepts. Logical set operators 'and', 'or', and 'not' are applied to whole sets of lexical items, rather than just single, specified lexical items. Because Metamorph is so fast, benchmarked even in its very early years of development as searching up to 4.5 megabytes per second in some Unix environments, we can read text live where required and get extremely impressive results.

Where stream text is involved, such as for a message analysis system where large amounts of information are coming in at a steady rate, or a news alert profiling system, you could not practically speaking tokenize all the data as it was coming in before reading it to find out if there was anything worth attending to in that data. Using a tokenized system to search incoming messages at a first pass would be very unwieldy, as well as inefficient and lacking in discretion. Indexes are more appropriately useful when searching large amounts of archived data. Texis makes use of Metamorph search engines internally where required to read the stream text in all of its rich context but without losing speed of efficiency.

Where Texis sends out a Metamorph query, it is fast and thorough in its search and its retrieval. A parser picks out phrases without having to mark them as such, along with any known associations. Search questions are automatically expanded to sets of concepts without you having to tell it what those sets are, or having to knowledge engineer connections in the files. Regular expressions can be located which other systems might miss. Numeric quantities entered as text and misspellings and typos can be found. Intersections of all the permutations of the defined sets can be located, within a defined unit of text defined by the user. No other search program is capable of this.

Even were you to find a comparable Database Management System with which to manage your text (which we challenge you to do!), at the bottom line, you could not find all the specific things that the Metamorph search engine would let you find. In Texis we now have a completely robust body; inside, it yet retains the heart and soul of Metamorph.


Copyright © Thunderstone Software     Last updated: Oct 5 2023
Copyright © 2024 Thunderstone Software LLC. All rights reserved.