Thunderstone Software Document Search, Retrieval, and Management
Search:
Vortex Manual
 

Page Fetching

 

The following urlcp settings control how or whether pages and related URLs are fetched, such as frames and iframes:

  • encodings [add|del|set] [$encodings ...]   Sets the list of allowed content/transfer encodings for pages fetched. The $encodings argument(s) are zero or more of the values identity, chunked, gzip, deflate or compress. The chunked encoding only applies to transfer encodings; the remainder apply to both content and transfer encodings. If the first value of the first argument is add, the given encoding(s) will be added to the allowed list; if del, deleted from it; if set, the list is cleared and set to $encodings (this is the default action if no add/del/set action is given). The keyword all may be used to refer to all encodings, and default may be used (with set) to re-set the default (which is identity, chunked, gzip, deflate and compress).

    The Vortex fetch library will declare the list of encodings it allows in Accept-Encoding and TE request headers, if httpversion is set to 1.1 (these are 1.1 headers, and some servers do not handle them as expected in a 1.0 request; httpversion is 1.1 by default in version 6 and later). It is up to the remote server to then choose encoding(s) from the declared list(s). The content encoding(s) (if any) of the returned document should be declared by the server in the Content-Encoding header, and transfer encoding(s) in the Transfer-Encoding header. Both types of encodings will be decoded before the document is returned from <fetch> or <urlinfo rawdoc>. If an encoding that is not allowed is encountered, a "Disallowed Content- or Transfer-Encoding" error is generated.

    Added in version 5.01.1249073000 20090731. Returns previous list of allowed encodings. See also the maxpgsize and maxdownloadsize settings for how they interact with encodings.

  • fileexclude (list) List of file trees to exclude (disallow) when fetching a local file:// URL. The default is none (no restrictions) for Windows, and "/dev/", "/proc/" and "/debug/" for Unix. After fileroot is applied, if the resulting local file path from a file:// URL has one of these paths as a prefix, the URL will not be fetched. This can be used to protect certain unsafe or private directories on a local filesystem from being inadvertently walked. Does not apply to FTP-mapped non-localhost file:// URLs. Added in version 4.02.1048785087 20030327. Aka fileexcludes. Returns previous setting.

  • fileinclude (list) List of file trees to include (require) when fetching a local file:// URL. The default is none (no restrictions). After fileroot is applied, if the resulting local file path from a file:// URL does not have one of these paths as a prefix, the URL will not be fetched. This can be used to keep a local filesystem walk within certain directories. Does not apply to FTP-mapped non-localhost file:// URLs. Added in version 4.02.1048785087 20030327. Aka fileincludes. Returns previous setting.

  • filenonlocal (string)

    How to handle non-localhost file:// URLs, i.e. ones with a specific host other than empty string or "localhost". The value can be one of:

    • off Default: do not allow non-localhost file:// URLs. This ensures that no FTP or UNC paths are used.

    • unc Map non-localhost file:// URLs to their UNC paths and attempt to open as a local file. E.g. the URL "file://myhost/mydir/myfile" would map to the file "\\myhost\mydir\myfile" under Windows and "//myhost/mydir/myfile" under Unix (but see modifications under fileroot below). This allows the behavior of web browsers that support UNC paths to be emulated on operating systems that support UNC, for consistency with browser views.

    • ftp Map non-localhost file:// URLs to FTP. E.g. the URL "file://myhost/mydir/myfile" would map to the URL "ftp://myhost/mydir/myfile" and be fetched as such. This allows the behavior of some browsers/operating systems that do not support UNC paths to be emulated.
    Added in version 4.02.1048785087 20030327. Returns previous setting.

  • fileroot (string)

    Sets the root directory to prepend to local file:// URL paths; default none. E.g. with fileroot set to "/docs", the URL "file://localhost/dir/file.txt" would be read from the file "/docs/localhost/dir/file.txt". Also applies to non-localhost URLs when filenonlocal is set to unc, e.g. the URL "file://myhost/mydir/myfile" is read from the file "/docs/myhost/mydir/myfile". This allows both localhost and non-localhost file:// URLs to be mapped to a single directory hierarchy, perhaps where network filesystems corresponding to individual host(s) are mounted. Added in version 4.02.1048785087 20030327. Returns previous setting.

  • filetypes [add|del|set] [file|dir|device|symlink|other ...] Sets the list of allowed file types for local file:// URLs. The possible values are file for ordinary files, dir for directories, device for devices, symlink for symbolic links (if supported by operating system), and other for other types (sockets etc.). If the first value of the first argument is add, the given list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). The default list is file, dir and symlink. If the file derived from a local file:// URL is not one of these types, it is disallowed. This prevents links to URLs like "file://localhost/dev/zero" from hampering a walk. Added in version 4.02.1048785087 20030327. Returns previous setting.

  • ftpactivepassivefallback (boolean)

    If on (the default), FTP passive mode fetches will fall back to active mode on failure, and vice-versa. This may help resolve a fetch to an FTP server that does not support the current mode (or is firewalled), i.e. in cases where ftppassive is not set properly for the given situation. Only failures of the PORT or PASV command, or a temporary (5nn) error response to the main (RETR/STOR/etc.) command will trigger the mode switch. Added in version 6.00.1304040000 20110428. Returns previous setting.

    Note that if the correct mode (active or passive) is already known in advance, it is preferable to set it from the outset via the ftppassive setting, to avoid potential delays and/or errors from relying on this fallback switchover.

  • ftppassive (boolean)

    If on (the default), FTP passive mode is used first for FTP protocol fetches. If off, FTP active mode is used first. Passive mode can be useful in situations where a firewall on the client (Vortex) side of the network prevents an FTP transfer (e.g. timeout). This is due to the nature of active-mode data transfers, where the remote (server) side is required to initiate a separate socket connection back to the client (even though the client initiates the original control connection). Many firewalls will block such incoming connections, causing the transfer to timeout. Passive mode allows the client to initiate both the control and data connections, which is often permitted by the client's firewall. Added in version 5.01.1121350905 20050714. Note: Prior to version 6.00.1304040000 20110428 this setting was off by default. Returns previous setting.

    Note that if ftpactivepassivefallback is on (the default), the alternate mode may be used if the first mode (set by this setting) fails.

  • ftprelativepaths (boolean)  

    If on (the default), FTP paths are assumed to be login-dir-relative, so the URL "ftp://host/dir/file.txt" would be fetched with "RETR home/dir/file.txt" instead of "RETR /dir/file.txt" (where home is the FTP user's login directory). For most (i.e. anonymous) FTP URLs this makes no difference, as the FTP login dir is typically at the root of the FTP-accessible tree. However, for many FTP URLs that require a true login, the FTP login dir is not the root dir, but the user's home directory. Thus, with ftprelativepaths on, the above URL would fetch "dir/file.txt" from the user's home directory - not "/dir/file.txt" from the root dir, where it may not exist. With ftprelativepaths off, the user's home directory - which may be unknown or vary from user to user - would have to be specified in the FTP URL in order to get back to the FTP login dir.

    Dirs outside the FTP login dir may still be accessed when ftprelativepaths is on, however, by encoding an extra slash in the URL, e.g. "ftp://host/%2Fdir/file.txt". Added in version 6. In previous versions the setting was effectively off.

  • ftpsendrelativepathsasabsolute (boolean)

    If on (the default), relative FTP paths (i.e. due to ftprelativepaths) are changed to absolute paths when sent to the server, by prefixing the login directory (obtained with a PWD after login). This avoids the occasional need for a no-argument "CWD" command to go back to the login directory (which some servers do not support), while still supporting the functionality of ftprelativepaths (no home dir needed in URLs). If off, the login directory is not prefixed; i.e. the URL "ftp://host/dir/file.txt" is fetched with "RETR dir/file.txt". This setting has no effect if ftprelativepaths is off. Added in version 6.00.1301360000 20110328.

  • getframes (boolean) If on, text frames are fetched for framed documents. The raw HTML returned will remain the same (the original document), but the formatted text from <urlinfo text> will be replaced and instead contain each frame in sequence. The links returned by <urlinfo links> will be the list of all the frames' links. The default is false, e.g. frames are not fetched. Only applies to HTTP URLs.

  • getiframes (boolean) If on, inline <IFRAME> documents are fetched. The raw HTML returned will remain the same (the original document). The formatted text from <urlinfo text> will also remain, except that <IFRAME> blocks will be replaced with their referenced document text in-line. The <IFRAME>s are removed from the iframes and links lists returned by the urlinfo function. Only applies to HTTP URLs. The default is false. Added in version 3.01.963000000 20000707.

  • getscripts (boolean) If on, and javascript is on, <SCRIPT SRC=...></SCRIPT> URLs on a page will also be fetched and run if they refer to JavaScript. If off (default), such URLs are not fetched, and only inline <SCRIPT>...</SCRIPT> scripts are run (if javascript is on). Returns previous value. Added in version 4.01.1023800000 20020611.

  • httpversion $version   Sets the HTTP version to use for requests. The $version argument is one of 0.9, 1.0 or 1.1. HTTP/1.0 is the default for Texis/Webinator version 5 and earlier; HTTP/1.1 is the default for Texis/Webinator version 6 and later (and is only conditionally supported). It may be necessary to set 1.1 to fully utilize some features, e.g. content/transfer encodings (see the encodings setting, here). Added in version 5.01.1249039000 20090731. Returns previous version.

  • ignoreanchorframes (boolean)   Whether to ignore frames and IFRAMEs that are just anchors, e.g. src="#". These usually just contain JavaScript, and fetching them just doubles up the content, links etc. of the parent URL. On by default. Added in version 6.

  • inputfileroot (string)

    If set, all set/non-empty <input type="file"> values must be within this local directory tree (and not contain "../" components to get out of it), when <urlcp domvalue "...submitContent"> or variants are called. Value(s) that are outside this setting will cause an error such as " Will not add form input `...' file `...' to submit content: Not in inputfileroot directory or contains `../'", and will be treated as empty (i.e. sent as empty value with no file). This is for security, to ensure all to-be-uploaded files are from a known directory. Added in version 6.00.1335222312 20120423. Default is unset (i.e. no check is performed). Returns 1 on success, 0 on error.

  • linkprotocols [add|del|set] [$protocols|allowed ...] Sets the list of protocols allowed to be returned in links from a page (i.e. the links value of the urlinfo function, here). Note that this setting does not control what can be fetched, only the list of links returned from a page. It can be used as a filter to remove invalid-protocol links returned by a page. The $protocols argument(s) are a list of zero or more values, each of which is either a recognized protocol (see protocols below), the value unknown for unknown protocols, or the value allowed for just protocols permitted by the protocols setting. The default is all protocols plus unknown. If the first value of the first argument is add, the given list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). Returns previous setting. Added in version 4.01.1029180431 20020812.

  • methods [add|del|set] [$methods ...]   Sets the list of request methods allowed for page fetching (default all). The $methods argument(s) are zero or more of the values OPTIONS, GET, HEAD, POST, PUT, DELETE, TRACE, MKDIR, RENAME, SCHEDULE, COMPILE or RUN. Not all methods are supported by all protocols; e.g. MKDIR is only supported by FTP. If the first value of the first argument is add, the given method list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). Alternately, the default methods may be restored with set default. Returns previous setting. Added in version 5.01.1232696000 20090123.

  • netmode (string) Sets the routines to use for page fetching. The default is int, which uses Texis' internal routines. For Windows versions, netmode may be set to sys, which uses the system routines. This may allow certain authenticated sites to be accessed, if the internal routines' NTLM authentication is not sufficient for example. However, parallelization and certain other features are disabled. Added in version 4.04.1068000000 20031104.

  • offsiteok (boolean) If on (default), documents that are off-site from the original URL (e.g. redirects) will be fetched if needed. If false, such redirects will not be fetched.

  • protocols [add|del|set] [$protocols ...] Sets the list of URL protocols allowed to be fetched. $protocols is a list of zero or more of the values http, ftp, gopher, javascript, https or file. The default list is http, ftp, gopher, javascript and https. (Note that <urlcp javascript> must be also on if JavaScript URLs are to work.) If the first value of the first argument is add, the given list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). Returns the previous list of allowed protocols. Added in version 4.01.1024300000 20020617. file support added in version 4.02.1048785087 20030327.

  • proxy (string) Takes a URL as an argument. This URL will be used as the proxy server to fetch documents. All future page fetches will go through this server, instead of being fetched directly. Must be an HTTP or HTTPS (in version 4.02.1048785087 20030327 and later) URL. In version 4.04.1077500000 20040222 and later, an empty string value will clear the proxy, i.e. turn it off.

Copyright © Thunderstone Software     Last updated: Mon Feb 18 10:28:15 EST 2013
 
Home   ::   Products   ::   Solutions   ::   How to Buy   ::   Support   ::   Contact Us   ::   News   ::   About
Copyright © 2013 Thunderstone Software LLC. All rights reserved.