|
SYNOPSIS
anytotx [options] [inputfile]
DESCRIPTION The anytotx program attempts to identify and translate its
input file to ASCII text. This can be used when crawling non-text
file formats (such as PDF and MS-Word), to obtain the plain text for
searching. (The SQL function totext() calls this program
internally.) There is built-in support for many common file formats,
and any new file format can be added by modifying the formats rule
config file.
The input file is given last on the command line, after any options;
if not present, standard input is assumed. The output is the text
version of the document, written to standard output. In version
4.02.1047588542 Mar 13 2003 and later, the output is always MIME, and
may be multi-part/mixed to support multi-file archives such as ZIP
files.
The following options are supported:
- -h
-
Print synopsis of options.
- -p
-
Select alternate text ordering for PDF conversion. By default,
the text output for PDFs is done linearly, so that hit markup with
pdfxml is done properly. However, this may output text in
a less desirable ordering for text searching, especially with
tables and multi-column pages. The -p option selects
non-linear text output mode.
- -pp
-
Select "pretty-print" mode for PDF conversion.
- -s
-
Keep short lines (3 characters or less) when converting in
-fOTHER mode. By default, short lines are suppressed as
they are often garbage.
- -Ppass
-
Use pass as the password to access protected
files (e.g. certain PDFs).
- -l
- (lower-case el)
Extract hyperlinks from document, where supported.
Each link is printed as a
Link: header in the MIME output.
- -mNAME
-
Extract meta data field NAME from document, where
supported. Common meta fields are
Title, Subject
and Keywords. Each meta field is printed as a header in
the MIME output.
- -M
-
Extract all known meta data. Varies by input type:
- HTML:
-
title
- Flash:
-
version, framesize,
framerate, framecount
- PDF:
-
Author, CreationDate,
ModDate, Creator, Producer, Title,
Subject, Keywords, X-Print, X-Change,
X-Copy, X-Addnotes, X-Linear, X-Encrypted,
X-Pages, X-PDF-Version, X-Tagged,
X-Filter-Version
- MSW,XLS,MSO:
-
Title, Subject, Author, Keywords,
Comments, Template, Last-Author,
Revision, Edit-Time, Printed, Created,
Saved, Pages, Words, Chars,
Thumbnail, Creator, Security, Category,
Target, Bytes, Lines, Paragraphs,
Slides, Notes, Hidden-Slides, MM-Clips,
Scale-Crop, Heading-Pairs, Titles,
Manager, Company, Links-Up-To-Date,
X-Filter-Version
- TIFF:
-
ImageWidth, ImageLength,
DocumentName, ImageDescription, Make,
Model, PageName, PageNumber, Software,
DateTime, Artist, HostComputer,
InkNames, TargetPrinter, Copyright,
- -fCODE
-
Assume input file is one of the built-in formats indicated by
CODE, which is one of:
- PDF
- for Adobe Acrobat PDF; MIME type
application/pdf
- HTML
- for HyperText Markup Language; MIME type
text/html
- MSW
- for Microsoft Word; MIME type
application/msword
- XLS
- for Microsft Excel
- MSO
- for other Microsoft formats (e.g. PowerPoint)
- SWF
- for Shockwave-Flash; MIME type
application/x-shockwave-flash
- GIF
- for Graphics Interchange Format; MIME type
image/gif.
Added in version 4.02.1046193282 Feb 25 2003.
- TIFF
- for Tag Image File Format; MIME type
image/tiff.
Added in version 5.00.1084000000 May 8 2004.
- TNEF
- for Microsoft Transport-Neutral Encoding Format; MIME type
application/tnef. Added in version 4.02.1047588542 Mar 13 2003.
- AUTO
- to auto-detect the format (the default)
- OTHER
- for an unknown format; MIME type
application/octet-stream
Codes are case-insensitive. The default is to automatically
detect the input file type (e.g. -fAUTO). Note that there
may be more file formats supported (via formats rule file) that
are listed here. It is not usually necessary to specify the input
type; most are detected properly. See also the
--content-type option which supercedes this.
- -g
-
Print additional information in headers, such as input
file type, translator arguments, etc.
- -G
-
Same as
-g, but quit: don't attempt actual translation.
- -v
-
Enable verbose output.
- -Dnnn
-
Enable debugging output, level nnn. Default is 0.
Optional nnn added in version 5.01.1110400000 Mar 9 2005.
- -uURL
-
Use URL as the URL of the input file (for informational
purposes, does not fetch anything).
- -,-install-dir=DIR
-
Set the Texis install dir to use. Default is as installed, or
typically /usr/local/morph3 under Unix. Added in version
4.03.1051600000 Apr 29 2003.
- -,-rule-file=FILE
-
Use formats rule file FILE. The default is the file
specified by
Rule File in the [Anytotx] section of
the conf/texis.ini config file, or if that is not set,
conf/formats.rule in the Texis install dir. If the formats
rule file cannot be found or read, a default internal version is
used. The formats rule file tells anytotx how to identify
and translate file formats; see below for syntax. Added in
version 4.02.1045857437 Feb 21 2003.
- -,-types-config=FILE
-
Use MIME types config file FILE. The default is the file
specified by
Types Config in the [Anytotx] section
of the conf/texis.ini config file, or if that is not set,
conf/mime.types in the Texis install dir. This file maps
MIME types to file extensions, as a fall back for identifying
files (a formats rule file entry is still usually needed). It is
the same format as Apache mime.types files, i.e. each line
is a MIME type followed by zero or more space-separated file
extensions (no dot). Added in version 4.02.1045857437 Feb 21
2003.
- -,-max-depth=N
-
Maximum depth to recurse when processing a file. Multiple
translators may need to be run to translate a file to text
(e.g. RTF to HTML to text). Keeping this setting low can prevent
an infinite loop if the content ever "bounces" between types.
The default is 5, which may need to be raised if complex,
multi-level translators are used. Added in version
4.02.1045857437 Feb 21 2003.
- -,-tmp=DIR
-
Use directory DIR for temporary files during translation.
The default is the dir specified by the environment variables
TMP, TMPDIR, TEMP or TEMPDIR. If no
environment variable is set, the dir C:\ (Windows) or
/tmp (Unix) is used, or the tmp subdirectory of
the Texis installation directory.
- -,-timeout=NNN
-
Timeout in seconds; default is 30. Use -1 for no timeout.
Added in version 4.03.1051675200 Apr 30 2003.
- -,-content-type=TYPE
-
Assume input is MIME type TYPE. The default is to
automatically detect the type. If specified, the MIME type should
be one that has a translator in the formats rule file, or a
built-in type such as
application/octet-stream. Added in
version 4.02.1045857437 Feb 21 2003.
- -,-error-log=FILE
-
Log errors to FILE. The default is standard error. Added
in version 4.02.1045857437 Feb 21 2003.
- -,-save-files
-
Save temporary output files in the temp dir. The default is to
delete them. This can be used to see the raw files unpacked from
an archive, before text translation (e.g. for TNEF or ZIP
archives).
- -,-output-enc=CHARSET
-
Use output encoding CHARSET where possible, e.g. UTF-8.
Added in version 5.01.1110258000 Mar 8 2005.
- -,-expand-ligatures=MIME
-
Expand single-character Unicode ligatures (e.g. "ffi" character)
into multiple characters for input MIME type MIME. Default
application/pdf; set none to turn off. Increases
document searchability, but may affect browser PDF plugin
highlighting. Added in version 5.01.1110258000 Mar 8 2005.
- -,-name=STR
-
Set name of object being processed, for error messages.
Added in version 4.03.1055995200 Jun 19 2003.
- -,-fix-mode=YN
-
Whether to fix attribute mode of unpacked files to non-hidden/readable.
Default for YN is
y.
Added in version 4.03.1059364800 Jul 28 2003.
- -,-trace-pipe=NNN
-
Debugging: set pipe trace level NNN.
Added in version 4.04.1071637200 Dec 17 2003.
Copyright © Thunderstone Software Last updated: Sun Mar 17 21:14:49 EDT 2013
|