|
One of the most common types of record oriented text is where a few
header lines precede a portion of narrative text. This whole pattern
is repeated throughout the file, so that there are many records per
file. You want to capture the headers to their respective fields, and
also capture the full text of the record to its own field. The sample
file timport.sch provides an example of this.
The individual fields might be defined as separate expressions, or
they might be defined as subexpressions of one large expression
defining the whole record. Where an expression is defined for an
entire record its value is assigned to the keyword recexpr for
record expression.
Where a recexpr is used, the individual fields can be defined
with numbers indicating which portion or range of the overall
expression is to be used to capture the data for that field. Where
recexpr is not used, each field will have its own REX
expression defined.
The expression for a field is referred to as its tag. Default
expressions can be used, or your own complete REX expression
constructed. In the example that follows, the fields are easily
tagged as From, Subject, Number, and Date.
The text of the whole record is stored in the field called
Text.
The first portion of the file timport.sch is the schema. The
last portion is sample text to import, which looks like this:
From: multiple record file
Subject: First multiple record
Number: 1
Date: 1995-04-19 11:31:00
This is my message; this is my file.
^L
From: multiple record file
Subject: Second multiple record
Number: 2
Date: 1995-04-19 11:32:00
This is another message.
^L
From: multiple record file
Subject: Third multiple record
Number: 3
Date: 1995-04-19 11:33:00
This is getting tedious!
I'm going to stop now.
Where multiple records occur in a single file, they would be separated
by some sort of repeating textual pattern. In this example, it is
easy to see the form feed character \x0c which appears as a
^L separating the 3 records. The keyword for this is
recdelim, for record delimiter. Where a recdelim is defined in
a schema file, it implies that there are multiple records.
Sometimes the definition of the fields within the records defines an
overall pattern which does not require a separate record delimiter.
In this case you would prefer to use the keyword multiple. With
a clear recdelim as in this example the keyword multiple
is not required.
Specifically, the schema rules are:
- recdelim is used for separating records out of an input
file containing multiple records. It implies "multiple".
- multiple indicates that there may be more than one record
per input file.
- recexpr is an expression that matches an entire record.
Field tags are then numbers indicating the subexpression range for the
field. It's good for records that are not well delimited (like
columns).
Note that this schema file uses a recdelim. Therefore it does
not need to also use the keyword multiple. It does not define
the entire record with one expression, just with individual fields, so
there is no recexpr defined.
Copyright © Thunderstone Software Last updated: Sun Mar 17 21:14:49 EDT 2013
|