Thunderstone Software Document Retreival and Management
Search:
Advanced Search
Home | Products | Company | News | Tech Support | Demos | Contact Us
Vortex Manual

Page Fetching

 

The following urlcp settings control how or whether pages and related URLs are fetched, such as frames and iframes:

  • embedsecurity (string)
    Sets security level for embedded URLs (eg. frames, scripts) in a page. This controls whether or not to fetch such embedded objects based on the relative security (https or not) of the main URL and the embedded object URL. Possible values are:

    • off
      Default: fetch any embedded object URL as requested.

    • nodecrease
      Do not fetch non-https embedded objects if the main page is https.

    • noincrease
      Do not fetch https embedded objects if the main page is non-https.

    • sameprotocol
      Do not fetch embedded objects unless they are the same protocol (http, https, ftp etc.) as the main page.
    Returns previous setting. Added in version 4.01.1031348302 20020906.

  • fileexclude (list)
    List of file trees to exclude (disallow) when fetching a local file:// URL. The default is none (no restrictions) for Windows, and ``/dev/'', ``/proc/'' and ``/debug/'' for Unix. After fileroot is applied, if the resulting local file path from a file:// URL has one of these paths as a prefix, the URL will not be fetched. This can be used to protect certain unsafe or private directories on a local filesystem from being inadvertently walked. Does not apply to FTP-mapped non-localhost file:// URLs. Added in version 4.02.1048785087 20030327. Aka fileexcludes. Returns previous setting.

  • fileinclude (list)
    List of file trees to include (require) when fetching a local file:// URL. The default is none (no restrictions). After fileroot is applied, if the resulting local file path from a file:// URL does not have one of these paths as a prefix, the URL will not be fetched. This can be used to keep a local filesystem walk within certain directories. Does not apply to FTP-mapped non-localhost file:// URLs. Added in version 4.02.1048785087 20030327. Aka fileincludes. Returns previous setting.

  • filenonlocal (string)
    How to handle non-localhost file:// URLs, ie. ones with a specific host other than empty string or ``localhost''. The value can be one of:

    • off
      Default: do not allow non-localhost file:// URLs. This ensures that no FTP or UNC paths are used.

    • unc
      Map non-localhost file:// URLs to their UNC paths and attempt to open as a local file. Eg. the URL ``file://myhost/mydir/myfile'' would map to the file ``\\myhost\mydir\myfile'' under Windows and ``//myhost/mydir/myfile'' under Unix (but see modifications under fileroot below). This allows the behavior of web browsers that support UNC paths to be emulated on operating systems that support UNC, for consistency with browser views.

    • ftp
      Map non-localhost file:// URLs to FTP. Eg. the URL ``file://myhost/mydir/myfile'' would map to the URL ``ftp://myhost/mydir/myfile'' and be fetched as such. This allows the behavior of some browsers/operating systems that do not support UNC paths to be emulated.
    Added in version 4.02.1048785087 20030327. Returns previous setting.

  • fileroot (string)
    Sets the root directory to prepend to local file:// URL paths; default none. Eg. with fileroot set to ``/docs'', the URL ``file://localhost/dir/file.txt'' would be read from the file ``/docs/dir/file.txt''. For non-localhost URLs when filenonlocal is set to unc, the leading slash is removed before fileroot is prepended. Eg. for the URL ``file://myhost/mydir/myfile'' the file ``/docs/myhost/mydir/myfile'' is read. This allows both local and non-localhost file:// URLs to be mapped to a single directory hierarchy (perhaps where NFS filesystems corresponding to ``myhost'' are mounted). Added in version 4.02.1048785087 20030327. Returns previous setting.

  • filetypes [add|del|set] [file|dir|device|symlink|other ...]
    Sets the list of allowed file types for local file:// URLs. The possible values are file for ordinary files, dir for directories, device for devices, symlink for symbolic links (if supported by operating system), and other for other types (sockets etc.). If the first value of the first argument is add, the given list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). The default list is file, dir and symlink. If the file derived from a local file:// URL is not one of these types, it is disallowed. This prevents links to URLs like ``file://localhost/dev/zero'' from hampering a walk. Added in version 4.02.1048785087 20030327. Returns previous setting.

  • ftppassive (boolean)
    If off (the default), FTP active mode is used for FTP protocol fetches. If on, FTP passive mode is used. Passive mode can be useful in situations where a firewall on the client (Vortex) side of the network prevents an FTP transfer (eg. timeout). This is due to the nature of active-mode data transfers, where the remote (server) side is required to initiate a separate socket connection back to the client (even though the client initiates the original control connection). Many firewalls will block such incoming connections, causing the transfer to timeout. Passive mode allows the client to initiate both the control and data connections, which is often permitted by the client's firewall. Added in version 5.01.1121350905 20050714 (default off in previous versions). Returns previous setting.

  • getframes (boolean)
    If on, text frames are fetched for framed documents. The raw HTML returned will remain the same (the original document), but the formatted text from urltext (p.  gif ) will be replaced and instead contain each frame in sequence. The links returned by urllinks will be the list of all the frames' links. The default is false, eg. frames are not fetched. Only applies to HTTP URLs.

  • getiframes (boolean)
    If on, inline <IFRAME> documents are fetched. The raw HTML returned will remain the same (the original document). The formatted text from urltext (p.  gif ) will also remain, except that <IFRAME> blocks will be replaced with their referenced document text in-line. The <IFRAME>s are removed from the iframes and links lists returned by the urlinfo function. Only applies to HTTP URLs. The default is false. Added in version 3.01.963000000 20000707.

  • getscripts (boolean)
    If on, and javascript is on, <SCRIPT SRC=...></SCRIPT> URLs on a page will also be fetched and run if they refer to JavaScript. If off (default), such URLs are not fetched, and only inline <SCRIPT>...</SCRIPT> scripts are run (if javascript is on). Returns previous value. Added in version 4.01.1023800000 20020611.

  • linkprotocols [add|del|set] [$protocols|allowed ...]
    Sets the list of protocols allowed to be returned in links from a page (ie. the links value of the urlinfo function, p.  gif ). Note that this setting does not control what can be fetched, only the list of links returned from a page. It can be used as a filter to remove invalid-protocol links returned by a page. The $protocols argument(s) are a list of zero or more values, each of which is either a recognized protocol (see protocols below), the value unknown for unknown protocols, or the value allowed for just protocols permitted by the protocols setting. The default is all protocols plus unknown. If the first value of the first argument is add, the given list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). Returns previous setting. Added in version 4.01.1029180431 20020812.

  • netmode (string)
    Sets the routines to use for page fetching. The default is int, which uses Texis' internal routines. For Windows versions, netmode may be set to sys, which uses the system routines. This allows the user/pass settings to be forwarded to NTLM-authenticated sites. However, parallelization and certain other features are disabled. Added in version 4.04.1068000000 20031104.

  • offsiteok (boolean)
    If on (default), documents that are off-site from the original URL (eg. redirects) will be fetched if needed. If false, such redirects will not be fetched.

  • protocols [add|del|set] [$protocols ...]
    Sets the list of protocols allowed to be fetched. $protocols is a list of zero or more of the values http, ftp, gopher, javascript, https or file. The default list is http, ftp, gopher, javascript and https. (Note that <urlcp javascript> must be also on if JavaScript URLs are to work.) If the first value of the first argument is add, the given list will be added to the allowed list; if del, deleted from; if set, cleared and set (the default). Returns the previous list of allowed protocols. Added in version 4.01.1024300000 20020617. file support added in version 4.02.1048785087 20030327.

  • proxy (string)
    Takes a URL as an argument. This URL will be used as the proxy server to fetch documents. All future page fetches will go through this server, instead of being fetched directly. Must be an HTTP or HTTPS (in version 4.02.1048785087 20030327 and later) URL. In version 4.04.1077500000 20040222 and later, an empty string value will clear the proxy, ie. turn it off.

Copyright © Thunderstone Software     Last updated: Wed Sep 10 11:16:28 EDT 2008
 
Home   ::   Products   ::   Company   ::   News   ::   Tech Support   ::   Demos   ::   Contact Us
Copyright © 2008 Thunderstone Software LLC. All rights reserved.