|
Two leading techniques in text retrieval that have heretofore been
used have been file inversion and tokenization. The dominant problem
with both of these techniques is that they require modification of the
original data to be searched in order for it to be accessible to the
data retrieval tool. The second problem, which has deeper
ramifications, is that in order to perform file inversion or
tokenization such programs make certain predisposed determinations
about the lexical items that will be later identified.
How good such programs are will depend in great part upon their
ability to identify and then locate specified lexical items. In most
cases the set of lexical items identified by the inversion or
tokenization routines is simply insufficient to guarantee the
retrieval of all things that one might want to search for. But even
where the set of identifiable lexical items is reasonably good, one
always has a certain basic limitation to contend with: one will never
be able to locate a superset of the lexical item listed in the look up
table.
It is for these reasons that when we make use of indexing techniques,
we supplement them with a final linear text read where required. Many
content oriented definitive searches must contain a linear read of
context to make accurate determinations involving relevance.
In systems where a lookup table exists containing either file pointers
or tokens, context is missing. You cannot search for something next
to something else, as no adequate record of related locations of items
is contained in the lookup table. You may yet find what you are
looking for, but you may have to convert the file to its original form
before you can do so.
It will be hard to find a program which stores any, let alone all,
possible combinations of words making up phrases, idiomatic
expressions, and acronyms in the lookup table. While you can look for
"Ming", or you can look for "Ling", you cannot directly look for
"Ming-Ling". Another tricky category is that of regular expressions
involving combinations of lexical items. If you are searching the
Bible by verse, you want to find a pattern of "digit, digit, colon,
digit". This cannot be done when occurrences of digits are stored
separately from the occurrences of colons.
Our own database tool Texis has a modifiable lexical analyzer, which
goes further than other indexing programs to attend to this problem.
However, a linear free text scan still gives the maximum flexibility
for looking for any type of pattern in relation to another pattern,
and therefore has been included in Texis as part of the search
technique used to qualify hits.
Metamorph is the search engine inside of Texis which contains maximum
capability for identification of a diverse variety of lexical items.
No other program has such an extended capability to recognize these
items. We can look for special expressions by themselves or in
proximity to sets of concepts. Logical set operators 'and', 'or', and
'not' are applied to whole sets of lexical items, rather than just
single, specified lexical items. Because Metamorph is so fast,
benchmarked even in its very early years of development as searching
up to 4.5 megabytes per second in some Unix environments, we can read
text live where required and get extremely impressive results.
Where stream text is involved, such as for a message analysis system
where large amounts of information are coming in at a steady rate, or
a news alert profiling system, you could not practically speaking
tokenize all the data as it was coming in before reading it to find
out if there was anything worth attending to in that data. Using a
tokenized system to search incoming messages at a first pass would be
very unwieldy, as well as inefficient and lacking in discretion.
Indexes are more appropriately useful when searching large amounts of
archived data. Texis makes use of Metamorph search engines internally
where required to read the stream text in all of its rich context but
without losing speed of efficiency.
Where Texis sends out a Metamorph query, it is fast and thorough in
its search and its retrieval. A parser picks out phrases without
having to mark them as such, along with any known associations.
Search questions are automatically expanded to sets of concepts
without you having to tell it what those sets are, or having to
knowledge engineer connections in the files. Regular expressions can
be located which other systems might miss. Numeric quantities entered
as text and misspellings and typos can be found. Intersections of all
the permutations of the defined sets can be located, within a defined
unit of text defined by the user. No other search program is capable
of this.
Even were you to find a comparable Database Management System with
which to manage your text (which we challenge you to do!), at the
bottom line, you could not find all the specific things that the
Metamorph search engine would let you find. In Texis we now have a
completely robust body; inside, it yet retains the heart and soul of
Metamorph.
Copyright © Thunderstone Software Last updated: Sun Mar 17 21:14:49 EDT 2013
|