|
SYNOPSIS
<fetch [METHOD=method] $theurl [$rawdoc]>
or
<fetch PARALLEL[=n] [METHOD=method] [URLS=]$urls [$loopvar ...]>
...
</fetch>
DESCRIPTION The fetch function retrieves URLs, using the HTTP GET
method (or its equivalent for other protocols), and returns the
(unformatted) documents in $ret. The METHOD option,
added in version 3.01.971300000 20001011, specifies an alternate
method; it may be one of OPTIONS, GET, HEAD,
POST, PUT, DELETE, TRACE, MKDIR, or
RENAME. Not all methods are supported by all protocols. Some
methods are mapped to an equivalent method when used with non-HTTP
protocols.
With the first syntax, a single URL (the first value of
$theurl) is retrieved. The urlinfo function
(here) can then be called to obtain more information
about the page just fetched. If a second argument ($rawdoc) is
given, it is used as the raw document source, instead of actually
fetching the URL. This provides a way to obtain the text, links,
etc. of arbitrary script-generated HTML that isn't present on an
actual web page . (To just HTML-decode an
arbitrary string without tag removal, word-wrapping or link parsing,
see the %H code to strfmt with the ! flag,
here.)
With the second (loop) syntax, all of the URLs in $urls are
fetched in parallel, that is, simultaneously. Once the first
(i.e. quickest) URL is completed, the fetch loop is entered,
with $ret set to the raw document source. As subsequent URLs
are completed, the loop is iterated again; once for each member of
$urls. Inside the loop, the urlinfo function can be used
to retrieve further information about the URL just completed.
It is important to note with the loop syntax that URLs are returned
fastest-first, which might not be the order they are present in
$urls. For example, suppose two URLs are being fetched where
the first URL takes 10 seconds to download and the other 3 seconds.
With the parallel loop syntax, the second will probably be returned
first, after 3 seconds; then 7 seconds later the first will be
completed. A URL that refers to an unresponsive web server will not
hold up other URLs; it is merely returned last, when it times out.
As an aid in tracking which URL was returned in each iteration, the
$urls variable and any subsequent $loopvar variables are
looped over in the fetch, but in the same order as returned
URLs. Thus $urls is set to the URL just retrieved inside the
loop.
The special variables $loop and $next are set and
incremented inside the loop as well: $loop starts at 0,
$next at 1.
If an argument to PARALLEL is given, only that many URLs
will be fetched simultaneously; the remaining ones are started only as
the first ones complete. The default (no argument to PARALLEL)
is to start fetching all URLs initially (in version 4 and earlier)
or only 10 at a time (version 5 and later).
DIAGNOSTICS
fetch returns the raw document just fetched (after
content/transfer encodings decoded), or the value of $rawdoc.
EXAMPLE This example uses the loop syntax to search multiple search engines
simultaneously. First, the user's query is placed into each URL with
sandr. Then the resulting URLs are fetched; because of the
PARALLEL flag, the fastest engine will return first. Each page
is then post-processed to remove HTML outside the
<BODY>/</BODY> tags - since there will be multiple
pages concatenated together - and displayed following a <BASE>
tag so that the user's browser knows the proper path for the links:
<urlcp timeout 10>
<$rooturls =
"http://www.searchsite1.com/cgi-bin/search?q=@@@"
"http://www.searchsite2.com/findit?query=@@@"
"http://www.searchsite3.com/cgi-bin/whereis?q=@@@&cmd=search"
>
<strfmt "%U" $query> <!-- URL-escape query -->
<sandr "[\?\#\{\}\+\\]" "\\\1" $ret> <!-- make sandr-safer -->
<sandr "@@@" $ret $rooturls> <!-- and insert into URLs -->
<$urls = $ret>
<BODY BGCOLOR=white>
<fetch PARALLEL $urls>
<sandr ".*>><body=[^>]+>=" "" $ret> <!-- strip non-BODY -->
<sandr "</body>=.*" "" $ret>
<$html = $ret>
<urlinfo actualurl>
<BASE HREF="$ret"> <!-- tell browser the base -->
<send $html> <!-- print site results -->
</fetch>
</BODY>
CAVEATS The PARALLEL syntax to fetch was added in version
2.1.902500000 19980807. Support for FTP was added in June 1998,
Gopher in version 2.6.938200000 19990924, HTTPS and javascript:
June 17 2002, and file:// URLs in version 4.02.1048785541
20030327. Protected file:// URLs (requiring user/pass)
supported for Windows in version 5.01.1123012366 20050802.
All URLs are returned, even those that cause an error (empty string
returned). The urlinfo function can then be used to obtain the
error code.
In versions prior to 3.0.949000000 20000127, or if
<urlcp dnsmode sys> is set, domain name resolution cannot be
parallelized due to C lib constraints. Thus, an unresponsive name
server (as opposed to a web server) may hold up other URLs, or even
exceed the <urlcp timeout> setting. In some versions, parallel
FTP retrieval is not supported.
Note that $loop and $next are merely incremented
inside the loop: they do not necessarily correspond to the array index
of the currently returned URL.
As little work as possible should occur inside a fetch loop,
as any time-consuming commands could cause in-progress fetches to
time out.
SEE ALSO
submit, urlinfo, urlcp
Copyright © Thunderstone Software Last updated: Mon Feb 18 10:28:15 EST 2013
|