Page Fetching

The following urlcp settings control how or whether pages and related URLs are fetched, such as frames and iframes:

  • encodings [add|del|set] [$encodings ...]

    Sets the list of allowed content/transfer encodings for pages fetched. The $encodings argument(s) are zero or more of the values 7bit, 8bit, binary, identity, chunked, gzip, deflate or compress. The chunked encoding only applies to transfer encodings; the remainder apply to both content and transfer encodings. If the first value of the first argument is add, the given encoding(s) will be added to the allowed list; if del, deleted from it; if set, the list is cleared and set to $encodings (this is the default action if no add/del/set action is given). The keyword all may be used to refer to all encodings, and default may be used (with set) to re-set the default (which is identity, chunked, gzip, deflate and compress).

    The Vortex fetch library will declare the list of encodings it allows in Accept-Encoding and TE request headers, if httpversion is set to 1.1 (these are 1.1 headers, and some servers do not handle them as expected in a 1.0 request; httpversion is 1.1 by default in version 6 and later). It is up to the remote server to then choose encoding(s) from the declared list(s). The content encoding(s) (if any) of the returned document should be declared by the server in the Content-Encoding header, and transfer encoding(s) in the Transfer-Encoding header. Both types of encodings will be decoded before the document is returned from <fetch> or <urlinfo rawdoc>. If an encoding that is not allowed is encountered, a "Disallowed Content- or Transfer-Encoding" error is generated.

    Added in version 5.01.1249073000 20090731. Returns previous list of allowed encodings. See also the maxpgsize and maxdownloadsize settings for how they interact with encodings.

    7bit, 8bit and binary were added in version 7.03.1430243000 20150428. These are MIME Content-Transfer-Encoding values; some web servers (Apache) are known to use them as HTTP Content-Encoding values however.

  • fileexclude (list) List of file trees to exclude (disallow) when fetching a local file:// URL. The default is none (no restrictions) for Windows, and "/dev/", "/proc/" and "/debug/" for Unix. After fileroot is applied, if the resulting local file path from a file:// URL has one of these paths as a prefix, the URL will not be fetched. This can be used to protect certain unsafe or private directories on a local filesystem from being inadvertently walked. Does not apply to FTP-mapped non-localhost file:// URLs. Added in version 4.02.1048785087 20030327. Aka fileexcludes. Returns previous setting.

  • fileinclude (list) List of file trees to include (require) when fetching a local file:// URL. The default is none (no restrictions). After fileroot is applied, if the resulting local file path from a file:// URL does not have one of these paths as a prefix, the URL will not be fetched. This can be used to keep a local filesystem walk within certain directories. Does not apply to FTP-mapped non-localhost file:// URLs. Added in version 4.02.1048785087 20030327. Aka fileincludes. Returns previous setting.

  • filenonlocal (string)

    How to handle non-localhost file:// URLs, i.e. ones with a specific host other than empty string or "localhost". The value can be one of:

    • off Default: do not allow non-localhost file:// URLs. This ensures that no FTP or UNC paths are used.

    • unc Map non-localhost file:// URLs to their UNC paths and attempt to open as a local file. E.g. the URL "file://myhost/mydir/myfile" would map to the file "\\myhost\mydir\myfile" under Windows and "//myhost/mydir/myfile" under Unix (but see modifications under fileroot below). This allows the behavior of web browsers that support UNC paths to be emulated on operating systems that support UNC, for consistency with browser views.

    • ftp Map non-localhost file:// URLs to FTP. E.g. the URL "file://myhost/mydir/myfile" would map to the URL "ftp://myhost/mydir/myfile" and be fetched as such. This allows the behavior of some browsers/operating systems that do not support UNC paths to be emulated.
    Added in version 4.02.1048785087 20030327. Returns previous setting. Note that filenonlocal only applies when a proxy is not set; when a proxy is active, all file:// URLs are passed to the proxy.

  • fileroot (string)

    Sets the root directory to prepend to local file:// URL paths; default none. E.g. with fileroot set to "/docs", the URL "file://localhost/dir/file.txt" would be read from the file "/docs/localhost/dir/file.txt". Also applies to non-localhost URLs when filenonlocal is set to unc, e.g. the URL "file://myhost/mydir/myfile" is read from the file "/docs/myhost/mydir/myfile". This allows both localhost and non-localhost file:// URLs to be mapped to a single directory hierarchy, perhaps where network filesystems corresponding to individual host(s) are mounted. Added in version 4.02.1048785087 20030327. Returns previous setting.

  • filetypes [add|del|set] [file|dir|device|symlink|other ...] Sets the list of allowed file types for local file:// URLs. The possible values are file for ordinary files, dir for directories, device for devices, symlink for symbolic links (if supported by operating system), and other for other types (sockets etc.). If the first value of the first argument is add, the given list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). The default list is file, dir and symlink. If the file derived from a local file:// URL is not one of these types, it is disallowed. This prevents links to URLs like "file://localhost/dev/zero" from hampering a walk. Added in version 4.02.1048785087 20030327. Returns previous setting.

  • followpermanentredirects (boolean)

    Whether to follow (fetch) permanent (301) redirects and their equivalents (e.g. file:// directory trailing-slash redirects). The default is on, which follows them. Turning this off results in a fetch error when such redirects are encountered - the <urlinfo errtoken> NotFollowingPermanentRedirect. See <urlinfo permanenturl>, canonicalurl, actualurl, redirs for how they are affected when followpermanentredirects is off.

    Stopping at permanent redirects allows a script to take other action when they are encountered (such as updating stored URL) before re-fetching the redirect. Added in version 8.01.1689976778 20230721; previous versions behaved as if this setting were always on.

  • ftpactivepassivefallback (boolean)

    If on (the default), FTP passive mode fetches will fall back to active mode on failure, and vice-versa. This may help resolve a fetch to an FTP server that does not support the current mode (or is firewalled), i.e. in cases where ftppassive is not set properly for the given situation. Only failures of the PORT or PASV command, or a temporary (5nn) error response to the main (RETR/STOR/etc.) command will trigger the mode switch. Added in version 6.00.1304040000 20110428. Returns previous setting.

    Note that if the correct mode (active or passive) is already known in advance, it is preferable to set it from the outset via the ftppassive setting, to avoid potential delays and/or errors from relying on this fallback switchover.

  • ftppassive (boolean)

    If on (the default), FTP passive mode is used first for FTP protocol fetches. If off, FTP active mode is used first. Passive mode can be useful in situations where a firewall on the client (Vortex) side of the network prevents an FTP transfer (e.g. timeout). This is due to the nature of active-mode data transfers, where the remote (server) side is required to initiate a separate socket connection back to the client (even though the client initiates the original control connection). Many firewalls will block such incoming connections, causing the transfer to timeout. Passive mode allows the client to initiate both the control and data connections, which is often permitted by the client's firewall. Added in version 5.01.1121350905 20050714. Note: Prior to version 6.00.1304040000 20110428 this setting was off by default. Returns previous setting.

    Note that if ftpactivepassivefallback is on (the default), the alternate mode may be used if the first mode (set by this setting) fails.

  • ftprelativepaths (boolean)

    If on (the default), FTP paths are assumed to be login-dir-relative, so the URL "ftp://host/dir/file.txt" would be fetched with "RETR home/dir/file.txt" instead of "RETR /dir/file.txt" (where home is the FTP user's login directory). For most (i.e. anonymous) FTP URLs this makes no difference, as the FTP login dir is typically at the root of the FTP-accessible tree. However, for many FTP URLs that require a true login, the FTP login dir is not the root dir, but the user's home directory. Thus, with ftprelativepaths on, the above URL would fetch "dir/file.txt" from the user's home directory - not "/dir/file.txt" from the root dir, where it may not exist. With ftprelativepaths off, the user's home directory - which may be unknown or vary from user to user - would have to be specified in the FTP URL in order to get back to the FTP login dir.

    Dirs outside the FTP login dir may still be accessed when ftprelativepaths is on, however, by encoding an extra slash in the URL, e.g. "ftp://host/%2Fdir/file.txt". Added in version 6. In previous versions the setting was effectively off.

  • ftpsendrelativepathsasabsolute (boolean)

    If on (the default), relative FTP paths (i.e. due to ftprelativepaths) are changed to absolute paths when sent to the server, by prefixing the login directory (obtained with a PWD after login). This avoids the occasional need for a no-argument "CWD" command to go back to the login directory (which some servers do not support), while still supporting the functionality of ftprelativepaths (no home dir needed in URLs). If off, the login directory is not prefixed; i.e. the URL "ftp://host/dir/file.txt" is fetched with "RETR dir/file.txt". This setting has no effect if ftprelativepaths is off. Added in version 6.00.1301360000 20110328.

  • getframes (boolean)

    If on, the <frame> objects of documents are fetched. The raw HTML returned will remain the same (the original document), but the formatted text from <urlinfo text> will be replaced and instead contain each frame in sequence. The links returned by <urlinfo links> or allrefs will be the list of all the frames' links (e.g. the original URL - frame parent - will not have its links nor frames etc. included). The default is false, e.g. frames are not fetched. Note that only one level of frames is fetched, i.e. a <frame> link inside a <frame> link will not be fetched.

  • getiframes (boolean)

    If on, inline <iframe> documents are fetched. The raw HTML returned will remain the same (the original document). The formatted text from <urlinfo text> will also remain, except that <iframe> blocks will be replaced with their referenced document text in-line. Note that like frames, only one level of <iframe>s is fetched. The <iframe> links are removed from the iframes, links, and allrefs lists returned by the urlinfo function. The default is false. Added in version 3.01.963000000 20000707.

  • getscripts (boolean)

    If on, and javascript is on, <SCRIPT SRC=...></SCRIPT> URLs on a page will also be fetched and run if they refer to JavaScript (and their URLs removed from the urlinfo links and allrefs lists). If off (default), such URLs are not fetched, and only inline <SCRIPT>...</SCRIPT> scripts are run (if javascript is on). Returns previous value. Added in version 4.01.1023800000 20020611.

  • httpversion $version Sets the HTTP version to use for requests. The $version argument is one of 0.9, 1.0 or 1.1. HTTP/1.0 is the default for Texis/Webinator version 5 and earlier; HTTP/1.1 is the default for Texis/Webinator version 6 and later (and is only conditionally supported). It may be necessary to set 1.1 to fully utilize some features, e.g. content/transfer encodings (see the encodings setting, here). Added in version 5.01.1249039000 20090731. Returns previous version.

  • ignoreanchorframes (boolean) Whether to ignore frames and IFRAMEs that are just anchors, e.g. src="#". These usually just contain JavaScript, and fetching them just doubles up the content, links etc. of the parent URL. On by default. Added in version 6.

  • inputfileroot (string)

    If set, all set/non-empty <input type="file"> values must be within this local directory tree (and not contain "../" components to get out of it), when <urlcp domvalue "...submitContent"> or variants are called. Value(s) that are outside this setting will cause an error such as "Will not add form input `...' file `...' to submit content: Not in inputfileroot directory or contains `../'", and will be treated as empty (i.e. sent as empty value with no file). This is for security, to ensure all to-be-uploaded files are from a known directory. Added in version 6.00.1335222312 20120423. Default is unset (i.e. no check is performed). Returns 1 on success, 0 on error.

  • ipprotocols [add|del|set] [$protocol ...]

    Sets the list of IP protocols to allow for page fetches. One or more of IPv4 and/or IPv6. If the first value of the first argument is add, the given list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). The default list, settable with set default, is currently IPv4 IPv6. Returns nonzero on success, zero on error. Added in version 8.

    Note that for DNS, the IP protocol used over-the-wire for lookup is not affected by this setting - as opposed to the DNS query type (A vs. AAAA), which is. The DNS lookup over-the-wire protocol is determinely solely by the IP family of the nameservers IP address(es) (here). Thus it is possible to use an IPv4 nameserver to look up and connect with IPv6 hosts, or vice-versa.

    Note that allowing both IPv4 and IPv6 may result in undesired behavior on occasion, i.e. if network/DNS/etc. configuration is inconsistent. For example, an IPv6 address may be found for a hostname, but fail to connect if the server only responds to its IPv4 address.

    Note that not all protocols are available (supported) on all systems; see <urlinfo ipprotocolsavailable> (here). See also <urlinfo ipprotocols> (here) for a list of allowed protocols, i.e. this setting's current value.

  • ipv6scopeidinhostheader (boolean)

    Whether to print the scope id (e.g. %eth0 part) of an IPv6 link-local address in the HTTP Host header, if such an address is used in the URL. Some web servers (e.g. Apache) do not accept scope ids (i.e. due to a strict interpretation of RFC 4007 11.2), and will fail web requests containing them. Turning off ipv6scopeidinhostheader (the default) causes scope ids not to be printed in Host headers, to conform with such servers. Added in version 8. Returns previous value of setting.

  • linkprotocols [add|del|set] [$protocols|allowed ...] Sets the list of protocols allowed to be returned in links from a page (i.e. the links value of the urlinfo function, here). Note that this setting does not control what can be fetched, only the list of links returned from a page. It can be used as a filter to remove invalid-protocol links returned by a page. The $protocols argument(s) are a list of zero or more values, each of which is either a recognized protocol (see protocols below), the value unknown for unknown protocols, or the value allowed for just protocols permitted by the protocols setting. The default is all protocols plus unknown. If the first value of the first argument is add, the given list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). Returns previous setting. Added in version 4.01.1029180431 20020812.

  • methods [add|del|set] [$methods ...] Sets the list of request methods allowed for page fetching (default all). The $methods argument(s) are zero or more of the values OPTIONS, GET, HEAD, POST, PUT, DELETE, TRACE, MKDIR, RENAME, SCHEDULE, COMPILE or RUN. Not all methods are supported by all protocols; e.g. MKDIR is only supported by FTP. If the first value of the first argument is add, the given method list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). Alternately, the default methods may be restored with set default. Returns previous setting. Added in version 5.01.1232696000 20090123.

  • netmode (string) Sets the routines to use for page fetching. The default is int, which uses Texis' internal routines. For Windows versions, netmode may be set to sys, which uses the system routines. This may allow certain authenticated sites to be accessed, if the internal routines' NTLM authentication is not sufficient for example. However, parallelization and many other settings and features are not unavailable. Added in version 4.04.1068000000 20031104.

  • offsiteok (boolean)

    If on (default), URLs that are off-site from the original URL will be fetched if needed. If false, such URLs will not be fetched. This includes redirects, components such as frames, iframes and scripts, and FTP data sockets. This setting does not affect the original (given) URL.

  • protocols [add|del|set] [$protocols ...]

    Sets the list of URL protocols allowed to be fetched. $protocols is a list of zero or more of the supported protocols http, ftp, gopher, javascript, https or file. The default list is http, ftp, gopher, javascript and https. (Note that javascript must be also on if JavaScript URLs are to work.) If the first value of the first argument is add, the given list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). Returns the previous list of allowed protocols. Added in version 4.01.1024300000 20020617. file support added in version 4.02.1048785087 20030327. Note that changing these protocols may affect what links are returned by <urlinfo links> (here), if linkprotocols (here) is allowed.

  • proxy [type] $proxyUrl

    Takes a URL as argument. The protocol, host and port of $proxyUrl will be used as the proxy or tunnel to fetch all future URL requests (except for javascript: URLs, which are always evaluated internally as they are source code, not resource locations). The $proxyUrl protocol must be HTTP or (in version 4.02.1048785087 20030327 and later) HTTPS. In version 4.04.1077500000 20040222 and later, an empty string value will clear the proxy, i.e. turn it off and resume direct connections.

    In version 7.05 and later, a type may be given; one of:

    • pacurl

      The URL given is a proxy auto-config URL, i.e. a JavaScript proxy.pac file containing a FindProxyForURL(url, host) function will return the prox(ies) to use for a given URL. The URL MIME type should be application/x-ns-proxy-autoconfig. The PAC script will be automatically fetched at the next <fetch> or <submit> statement, as needed.

    • pacscript

      The URL argument is instead the PAC script itself.

    If proxy auto-config is enabled via either of these types, the script's FindProxyForURL() function will be called for every URL. This function returns a list of one or more proxies to use for the given URL (or DIRECT to not use a proxy). Thus, proxy auto-config allows URL-by-URL customization of proxies, and/or organization-wide proxy configuration. The PAC script is run in a restricted environment; see findproxyforurl.com for details on the JavaScript functions available to PAC scripts.

    Failure to fetch the pacurl script will cause the current <fetch> to fail with PacError, as well as all subsequent <fetch>es until pacfetchretrydelay expires (here). Proxies that are deemed "bad" (e.g. unresponsive) will have a lower priority for proxyretrydelay seconds (here).

    An optional proxy mode argument (same as proxymode) may also be given after the URL, if no type is specified. This syntax is for back-compatibility and is deprecated in favor of the proxymode setting; it may be removed in future releases.

  • proxymode $mode

    Determines how to use a proxy (if set). $mode is one of:

    • auto

      Automatically select proxy or tunnel mode depending on the requested URL: tunnel for HTTPS, otherwise proxy. This is the default if $mode is not specified.

    • proxy

      Always use proxy mode, i.e. tell proxy to GET (or whatever method was requested) the requested absolute URL.

    • tunnel

      Always use tunnel mode, i.e. tell proxy to CONNECT to the requested host and port, then proceed with request as if directly connected to the requested server. If the request URL's protocol cannot be tunneled (e.g. file: or FTP), the request fails.

    Added in version 7.05. In versions prior to 7.00.1363052000 20130311, the mode was always proxy, even for javascript: URLs. In version 7.00.1363052000 20130311 through 7.04, the default mode was auto.

    Note that an HTTPS tunnel to an HTTPS origin server is not currently supported by the fetch library (an HTTPS proxy to an HTTPS server is supported, however). Many tunnels do not support an HTTPS tunnel to an HTTP origin server, and some proxies do not support HTTPS origin servers (since the proxy would then have to provide a certificate etc.).


Copyright © Thunderstone Software     Last updated: Apr 15 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.