|
REX is a highly optimized pattern recognition tool that has been
modeled after the UNIX family of tools: GREP, EGREP,
FGREP, and LEX. Wherever possible REX's syntax
has been held consistent with these tools, but there are several major
departures that may bite those who are used to using the GREP
family.
REX uses a combination of techniques that allow it to surpass the
speed of anything similar to it by a very wide margin.
The technique that provides the largest advantage is called "state
anticipation or state skipping" which works as follows ...
If we were looking for the pattern:
ABCDE
in the text:
AAAAABCDEAAAAAAA
a normal pattern matcher would do the following:
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
AAAAABCDEAAAAAAA
The state anticipation scheme would do the following:
ABCDE
ABCDE
AAAAABCDEAAAAAAA
The normal algorithm moves one character at a time through the text,
comparing the leading character of the pattern to the current text
character of text, and if they match, it compares the leading pattern
character +1 to the current text character +1, and so on ...
The state anticipation pattern matcher is aware of the length of the
pattern to be matched, and compares the last character of the pattern
to the corresponding text character. If the two are not equal, it
moves over by an amount that would allow it to match the next
potential hit.
If one were to count the number of comparison cycles for each pattern
matching scheme using the example above, the normal pattern matcher
would have to perform 13 compare operations before locating the first
occurrence; versus 6 compare operations for the state anticipation
pattern matcher.
One concept to grasp here is: "The longer the pattern to be
found, the faster the state anticipation pattern matcher will be;"
while a normal pattern matcher will slow down as the pattern gets
longer.
Herein lies the first major syntax departure: REX always applies
repetition operators to the longest preceding expression. It does
this so that it can maximize the benefits of using the state skipping
pattern matcher.
If you were to give GREP the expression: ab*de+
It would interpret it as: an `a' then 0 or more `b's'
then a `d' then 1 or more `e's'.
REX will interpret this as: 0 or more occurrences of
`ab' followed by 1 or more occurrences of `de'.
The second technique that provides REX with a speed advantage is
ability to locate patterns both forwards and backwards
indiscriminately.
Given the expression: "abc*def", the pattern matcher is
looking for "Zero to N occurrences of `abc'
followed by a `def"'.
The following text examples would be matched by this expression:
abcabcabcabcdef
def
abcdef
But consider these patterns if they were embedded within a body of text:
My country 'tis of abcabcabcabcdef sweet land of def, abcdef.
A normal pattern matching scheme would begin looking for
`abc*'. Since `abc*' is matched by every position
within the text, the normal pattern matcher would plod along checking
for `abc*' and then whether it's there or not it would try to
match `def'. REX examines the expression in search of the the
most efficient fixed length subpattern and uses it as the root of
search rather than the first subexpression. So, in the example above,
REX would not begin searching for `abc*' until it has located a
`def'.
There are many other techniques used in REX to improve the rate at
which it searches for patterns, but these should have no effect on the
way in which you specify an expression.
The three rules that will cause the most problems to experienced
GREP users are:
- Repetition operators are always applied to the longest
expression.
- There must be at least one subexpression that has one or more
repetitions.
- No matched subexpression will be located as part of another.
- Rule 1 Example
- :
abc=def* Means one `abc' followed by 0 or more `def's' .
- Rule 2 Example
- :
abc*def* CANNOT be located because it matches every position within the text.
- Rule 3 Example
- :
a+ab Is idiosyncratic because `a+' is a subpart of `ab'.
Copyright © Thunderstone Software Last updated: Sun Mar 17 21:14:49 EDT 2013
|