|
The formats rule file (typically conf/formats.rule in the
install dir; see --rule-file option) tells anytotx how
to identify and translate various file formats. Its syntax is loosely
based on the Unix magic utility's config format, with
extensions, and was added in version 4.02.1045857437 Feb 21 2003.
Each line specifies a content test, a MIME type, and a translator.
Each test is run in order; the first successful test indicates the
input is identified as the corresponding MIME type, and the translator
is run to translate the input to text. (If the MIME type is specified
on the command line via -f or --content-type, the
appropriate translator is searched for by MIME type instead of content
tests.)
If a line begins with a greater-than sign (">"), it is a
sub-test, specifying an additional content test for that MIME type.
The test level is indicated by the number of such leading
greater-than signs; most tests have none and are thus top-level or
level 1. A level N test's children are all tests at level
N+1 that follow it, up to (but not including) the next level N
test. If a test at level N succeeds, its children (if any) are run
recursively in config file order. This process continues until a
successful test with a non-empty command line and no successful
children is found. In this way, complex input types can be identified
that require more than one part of the data to be examined.
Once a translator is identified and run, its output is examined. If
it is identified as a known non-text type, another translator may be
run to convert it again. For example, an RTF input file may be
identified and translated to HTML via one translator. That output is
identified as HTML, and is translated again (via a built-in
translator) to text. Because of this multiple-pass feature,
translators can be used that do not output text, but output a type
that another translator can handle.
Some translators may not produce any output at all, but produce a
series of files, in a new directory. These translators have
%DIROUTPUT% (or %TMP%, see below) in their command line
in the formats rule file. After such a translator is run, any files
it created are recursively processed by anytotx, and the
resulting output will typically be multi-part MIME. In this way,
archive formats such as ZIP and tar can be processed.
Other translators may not take a file as input, but instead take a
directory tree, usually unpacked from a previous ZIP or tar
archive %DIROUTPUT% rule. These rules have the
%DIRINPUT% option set (below), and have no content test; the
offset datatype value fields are each a single dash
("-"). These rules are for archive dirs that are actually
a single monolithic document (e.g. Open Document Format file),
not a group of distinct files.
Each line of the formats rule file is of the following form (blank
lines and pound-sign/semicolon comment lines are ignored):
[>...]offset datatype value mimetype commandline
The offset, datatype and value fields specify
the content test. (For rules with no content test,
e.g. %DIRINPUT% rules, each of these fields is a single dash.)
The MIME type is given by mimetype, and the corresponding
translator's space-separated command line follows. Each field has a
particular syntax, as explained below:
- offset
-
Specifies the integer file offset to look at in the input. May be
decimal, hexadecimal or octal. The test data is read at this
offset. (If the rule has no content test, e.g. a
%DIRINPUT%
rule, the offset, data type and value may each be a single dash.)
If the offset is in parentheses, it is indirect. An indirect
offset is of the form:
(X[b|s|l|B|S|L[8|16|32|64]][+|-Y]). A
value is read at offset X, which is in turn used as the offset
for the test data. The value type and size read is determined by
an optional suffix after the indirect offset:
-
-
b A byte
-
-
s A little-endian short
-
-
l (lower-case el) A little-endian long
-
-
B A byte
-
-
S A big-endian short
-
-
L A big-endian long
After the indirect suffix, an optional bit size may appear. This
overrides the size indicated by the suffix, whose size may vary by
platform. The bit size must be 8, 16, 32 or (on platforms that
support it) 64.
After the indirect suffix and/or bit size, an optional sub-offset Y
may appear. This positive or negative integer is added to the
offset value read to compute the offset for the test.
- datatype
-
The type and size of data to read at the offset. One of the
following values:
-
-
byte A single byte
-
-
short A short value
-
-
long A long value
-
-
string A string value (size determined by value)
-
-
date A Unix time_t date
An integer (non-string) type may optionally be prefixed by
u to indicate an unsigned compare is to be made, be
to indicate a big-endian value, and/or le to indicate
little-endian. An optional bit-size suffix of 8, 16, 32 or (if
supported) 64 may also be appended to override the size indicated
by the type (which is platform-dependent).
A string type may be prefixed by i to indicate a
case-insensitive compare.
After the data type and optional suffix, an optional value mask
may appear for integer types. This is indicated by an ampersand
("&") and integer value (decimal, octal or hex). This
value mask will be bit-wise ANDed to the input value before
comparing for the test.
- value
-
The specified value to compare with the input value for the test.
It is an optional operator character followed by a value. The
possible operator characters are:
-
-
= Input value must equal specified value (default)
-
-
< Input value must be less than specified value
-
-
> Input value must be less than specified value
-
-
& Input value must have all bits set that are set in
the specified value, i.e. input value bit-wise ANDed with
specified value must equal specified value
-
-
^ Input value must have cleared any bit that is
set in the specified value, i.e. input value bit-wise ANDed with
specified value must not equal specified value
-
-
x No-op: any value will match
The value must be an integer (decimal, octal or hex) for integer
types, or a string for string or date types. String values will
have C-style escapes translated. A date value must be a Texis-parseable
date value. For string types, only the operators "=",
"<" and ">" are valid.
- mimetype
-
The MIME type associated with this test. Multiple tests can have
the same MIME type, e.g. if there are multiple ways to identify it.
In version 6 and later, the MIME type may also contain asterisk
("
*") wildcards, to match a group of MIME types.
- commandline
-
The translator, i.e. the space-separated command line with
arguments to run to translate input of this MIME type, preferably
to text. May be empty (i.e. "
""") to indicate there is no
translator; this means that the MIME type is not fully identified
by this test and sub-tests must be run.
The command line may contain certain special variables, enclosed
in percent-signs ("%"). These variables will be replaced
with certain values in the command line, or indicate certain
options. Options will be removed from the command line, and
should occur first, i.e. before the program name.
-
-
%IGNORE%
Option: There is no translator; this MIME type contains no
text and is to be ignored. Useful for identifying non-text
types like images; otherwise, the fallback -fOTHER mode
may be used, which would print garbage. Should be used alone.
-
-
%DIROUTPUT%
Replaced with a unique, empty temporary directory, which is
created and chdir()'d to before running the translator.
This also indicates that the translator is expected to create
multiple output files in this temporary dir, e.g. unpack a
multi-file ZIP archive. The resulting anytotx output
will be multi-part MIME, and each unpacked file will be
recursively processed further by anytotx.
If the %DIROUTPUT% variable occurs as one of the first
items in the command line (i.e. before the program), then it
is an option and is removed from the command line, but all
other behavior is the same. This is useful for un-archiving
translators that do not take a target dir argument, but
nonetheless unpack an archive to the current directory.
In version 5 and earlier, this variable was %TMP%,
which is still supported but is deprecated.
-
-
%DIRINPUT%
Option: this rule takes a directory (e.g. containing multiple
associated files) as input, not a file. Useful for
translators that work on an unpacked archive to produce a
single output. For example, the Open Document Format
translator odf takes the unzipped document directory
tree as input (instead of the original .odt file), and
outputs the document text. Since there is no file input with
%DIRINPUT% rules, there cannot be a content test, so
the offset datatype value fields must each be a single
dash to indicate no test. Added in version 6. See also the
archivemimefile setting, which is typically how these
rules are recognized (instead of by content test).
-
-
%8.3%
Option: Use MSDOS-style 8.3 filenames where possible. Useful
for older MSDOS executable translators that can't handle long
filenames. No effect on non-Windows platforms.
-
-
%MIME%
Option: The translator produces MIME output, i.e. headers,
which will be read and stripped from the output. Headers that are
significant include: Content-Type, X-Input-Content-Type,
Content-Transfer-Encoding and X-Translator-Status.
Some of these headers are used by translators to further identify
the input, and tell anytotx how to proceed.
-
-
%IGNORESTDOUT%
Option: Ignore the standard output of the translator, instead
of parsing and/or reporting it. Typically used with some
un-archiving translators that produce unwanted standard-out
messages in addition to unpacking files. Note: If output goes
to a file instead of standard-out, but should be
reported, use %OUT% instead. Added in version
5.01.1202350000 20080206.
-
-
%IGNORESTDERR%
Option: Ignore the standard-error output of the translator,
instead of reporting it as an error.
-
-
%ANYTOTX%
Replaced with the path to the running anytotx
executable. Used in conjunction with %ANYTOTXFLAGS%
to use anytotx to translate a known built-in MIME type.
Should only be used for anytotx built-in MIME types.
-
-
%ANYTOTXFLAGS%
Replaced with the command-line flags passed to the running
anytotx executable. Any --max-depth argument is
stripped. An appropriate (decremented) --max-depth
argument and a --content-type argument are added.
Thus, the called anytotx will already know the MIME
type and won't try to identify it, and will also know the
current flags like -g. Should only be used for
anytotx built-in MIME types (otherwise a loop occurs
and the data isn't translated).
-
-
%IN% or %IN.ext%
Replaced with the anytotx input file name. This must
be given for translators that expect an explicit input
filename on their command line. The standard input for the
translator will also be redirected from /dev/null (the
default is to redirect from the anytotx input file).
If the anytotx input is not a file but is standard
input, a temporary file will be created and the input copied
to it.
The second version (%IN.ext%) is useful where a
translator expects its input file to have a certain extension.
The input file name that replaces %IN.ext% on the
command line will have the extension .ext. If the
actual input file name does not, or comes from standard in, an
appropriate temporary file will be created and the input
copied to it.
-
-
%OUT% or %OUT.ext%
Replaced with an output file name. This must be given for
translators that expect an explicit output filename on their
command line. If given, the standard output for the
translator is ignored and this file will be read afterwards;
if not given, the standard output from the translator will be
used afterwards. A unique empty temporary file is created.
The second version (%OUT.ext%) is useful where a
translator expects its output file to have a certain
extension. The output file name that replaces
%OUT.ext% on the command line will have the extension
.ext.
-
-
%INSTALLDIR%
Replaced with the Texis install dir.
-
-
%BINDIR%
Replaced with the Texis binary dir (same as install dir for
Windows, install dir plus "/bin" for Unix).
-
-
%EXEDIR%
Replaced with the directory of the currently running
executable. Added in version 5.01.1214185000 20080622. In
version 7 and later, if the executable dir is not
determinable, the Texis binary dir will be used.
-
-
%%
Replaced with a single percent-sign ("%").
Command line arguments may be quoted (single or double). Under Unix,
the enclosed values become a single argument and the quotes are
stripped. Under Windows the quotes are untouched and it is up
to the translator to parse its command line accordingly. When
a special variable is replaced with its value in the command line,
the value (and its adjacent non-whitespace characters, if any)
will automatically be quoted if it contains spaces
and is not already explicitly quoted. For %ANYTOTXFLAGS%,
the quoting is applied on a per-argument basis.
In version 6 and later, the formats.rule file may also contain
the following setting:
-
archivemimefile file
Certain file types are actually archives (ZIP files) that describe
a single document, not multiple. Some of these archives contain a
file that describes the MIME type of the document. These MIME
type files can be recognized with the archivemimefile
setting. The named file, if seen after an archive
(%DIROUTPUT% rule) is unpacked (or a directory is given as
input to anytotx instead of a file), is checked for
%DIRINPUT% rules' MIME types. If a matching MIME type is
found, the %DIRINPUT% rule is then run, instead of the
normal recursive processing of the individual files in the
directory.
For example, Open Document Format files are really ZIP archives,
and contain a file called mimetype that contains the MIME
type of the document
(e.g. "application/vnd.oasis.opendocument.text"). Thus,
after the ZIP rule unpacks an Open Document Format file (like any
other ZIP), an archivemimefile mimetype setting would tell
anytotx to look for the mimetype file: if it is
found, and contains a MIME type that matches a %DIRINPUT%
rule, that translator is run. Otherwise, the ZIP file's contents
would be processed individually.
EXAMPLE An example formats.rule file might be:
0 string =PK\003\004 application/zip %BINDIR%/unzip -d %DIROUTPUT% %IN%
99 byte x0 application/octet-stream ''
>0 beshort16 =0xd0cf application/msword %ANYTOTX% %ANYTOTXFLAGS%
>0 beshort16 =0xdba5 application/msword %ANYTOTX% %ANYTOTXFLAGS%
The first line's test is to check for the string PK followed
by ASCII char 3 and ASCII char 4, at offset 0 in the input. If the
string matches, the MIME type is application/zip, and the
program unzip in the Texis binary dir is run, with the input
file as the last argument. Multiple output files are expected to be
written to the unique %DIROUTPUT% dir by unzip and will be
recursively processed by anytotx.
The next 3 lines are all related, because the last 2 have a
greater-than sign indicating they are sub-tests of the one above. The
first test matches any byte at offset 99 in the input file. In
effect, it verifies the input is at least 100 bytes long. But there
is no translator specified ("''"), so the input isn't
identified yet. The sub-tests are run: each looks for a different
16-bit big-endian short integer at offset 0. The MIME type and
translator are the same for both, and indicate that anytotx
should be run to process the file. Since %ANYTOTXFLAGS% will
have a --content-type argument appended, the sub-process
anytotx will know the type and run its built-in translator
directly.
CAVEATS The anytotx plugin's availability is license dependent.
Contact Thunderstone for details.
Versions of anytotx before 4.02.1045857437 Feb 21 2003
may not print any headers in the output, e.g. if no meta data is
requested.
Copyright © Thunderstone Software Last updated: Sun Mar 17 21:14:49 EDT 2013
|