Rex Regular EXpression Pattern Matcher

Q: What does it do?
A: It finds things that nothing else can!

Like:

People stick many things in text that you cannot locate with the simple wildcard matching that most other packages are limited to. The Rex matcher is one of those features that is not called upon very often, but when it's needed it will save a tremendous amount of time and aggravation! Using a text retrieval package without this ability is like driving a car that doesn't have a spare tire; eventually you'll be stranded.

Below is the rather lengthy description of Rex's syntax. Fortunately you don't need to learn it because there are macro's and easy-to-use expression builders within Metamorph to help guide you through.


REX Expression Syntax

Expressions are composed of characters and operators. Operators are characters with special meaning to Rex. The following characters have special meaning: "\+*{},[]^$.-!" and must be escaped with a '\' if they are meant to be taken literally. The string ">>" is also special and if it is to be matched, it should be written "\>>".

Repetition Operators

A regular expression may be followed by a repetition operator in order to indicate the number of times it may be repeated.

Caveats and Commentary

REX is a highly optimized pattern recognition tool that has been modeled after the UNIX family of tools: GREP, EGREP, FGREP, and LEX. Wherever possible REX's syntax has been held consistent with these tools, but there are several major departures that may bite those who are used to using the GREP family.

REX uses a combination of techniques that allow it to surpass the speed of anything similar to it by a very wide margin.

The technique that provides the largest advantage is called "state-anticipation or state-skipping" which works as follows:

if we were looking for the pattern:

                       ABCDE
in the text:

                       AAAAABCDEAAAAAAA

a normal pattern matcher would do the following:

                       ABCDE
                        ABCDE
                         ABCDE
                          ABCDE
                           ABCDE
                       AAAAABCDEAAAAAAA

The state-anticipation scheme would do the following:

                       ABCDE
                           ABCDE
                       AAAAABCDEAAAAAAA
The normal algorithm moves one character at time through the text, comparing the leading character of the pattern to the current text character of text, and if they match, it compares the leading pattern character +1 to the current text character +1 , and so on...

The state anticipation pattern matcher is aware of the length of the pattern to be matched, and compares the last character of the pattern to the corresponding text character. If the two are not equal, it moves over by an amount that would allow it to match the next potential hit.

If one were to count the number of comparison cycles for each pattern matching scheme using the example above, the normal pattern matcher would have to perform 13 compare operations before locating the first occurrence VS. 6 compare operations for the state-anticipation pattern matcher.

One concept to grasp here is that: "The longer the pattern to be found, the faster the state-anticipation pattern matcher will be." While a normal pattern matcher will slow down as the pattern gets longer.

Herein lies the first major syntax departure: REX always applies repetition operators to the longest preceding expression. It does this so that it can maximize the benefits of using the state-skipping pattern matcher.

If you were to give GREP the expression : ab*de+
It would interpret it as:
 an "a" then 0 or more "b"'s then a "d" then 1 or more "e"'s.

REX will interpret this as:
 0 or more occurrences of "ab" followed by 1 or more occurrences of "de".
The second technique that provides REX with a speed advantage is ability to locate patterns both forwards and backwards indiscriminately.

Given the expression: "abc*def", the pattern matcher is looking for "Zero to N occurrences of 'abc' followed by a 'def'".

The following text examples would be matched by this expression:

     abcabcabcabcdef
     def
     abcdef
But consider these patterns if they were embedded within a body of text:

     My country 'tis of abcabcabcabcdef sweet land of def, abcdef.

A normal pattern matching scheme would begin looking for 'abc*' . Since 'abc*' is matched by every position within the text, the normal pattern matcher would plod along checking for 'abc*' and then whether it's there or not it would try to match "def". REX examines the expression in search of the the most efficient fixed length sub-pattern and uses it as the root of search rather than the first sub-expression. So, in the example above, REX would not begin searching for "abc*" until it has located a "def".

There are many other techniques used in REX to improve the rate at which it searches for patterns, but these should have no effect on the way in which you specify an expression.

The three rules that will cause the most problems to experienced GREP users are:

Rule 1 example:

abc=def*  Means one "abc" followed by 0 or more "def"'s .

Rule 2 example:

abc*def*  CAN NOT be located because it matches every position within the text.

Rule 3 example:

a+ab  Is idiosyncratic because "a+" is a subpart of "ab".
back up




Copyright © 1996 Thunderstone Software