|
SYNOPSIS
<urlinfo $name [$which]>
DESCRIPTION The urlinfo function returns information about the last page
retrieved with fetch or submit. Inside a fetch
loop (e.g. with the PARALLEL flag), this is the page just
returned for the current loop. The $name argument describes
what to return; some values take a second $which argument (as
noted below). Possible values for $name and what they return
are:
-
actualurl (string)
The last URL retrieved. It may differ from the argument to
fetch or submit, e.g. if redirects were followed. -
authparams (list)
The list of names of parsed authentication parameters sent by the
server. The value for a particular parameter name can be obtained
with authparam $param. Added in version 5.1. Note that
authentication parameters may not be available even if authentication
is used, if the server does not send them. For example, the
second and later requests on a connection may not need parameters,
if credentials are sent with the initial request and thus the server
does not need to challenge in the response. -
authparam $param (string)
The authentication parameter $param from the server.
$param may be realm for the Basic authentication
realm, target for the NTLM target (i.e. domain), or
serverchallenge for the NTLM server challenge nonce.
Added in version 5.1. Authentication parameter names are
case-insensitive. -
authscheme (string)
The authentication scheme used. Returns one of the scheme tokens
used by <urlcp authschemes> (here).
Added in version 5.01.1239140000 20090407. -
authschemehighest (string)
The highest (most secure) authentication scheme used during the
entire transaction, i.e. across redirects (if any). E.g. if a
Basic authentication protected page was fetched, which then
redirected to an anonymous-access page, authschemehighest
would return Basic, even though authscheme would
return anonymous (from the last page). Returns one of the
scheme tokens used by <urlcp authschemes>
(here). Added in version 5.01.1239140000
20090407. -
charsetconfigtotext (string)
The current charset configuration, in the format used by
<urlcp charsetconfigfromfile>
(here). Added in version 6. -
charsetdetected (string)
The charset of the source page, as detected by scanning the document,
without parsing explicit charset labels. Added in version 5. -
charsetexplicit (string)
The charset of the source page, as explicitly set in a header or
<META HTTP-EQUIV> label. Returns Unknown if
unknown or not set. Added in version 5. -
charsetsrc or charsetsource (string)
The charset of the source page, as interpreted by the parser.
This is taken from the first available source, in descending
priority: the charset as set by <urlcp charsetsrc>; the
charset explicitly set in the page (header or meta); the charset
detected by scanning the document; or the <urlcp
charsetsrcdefault> charset. Added in version 5. -
charsettxt or charsettext (string)
The charset of the formatted text (as returned by <urlinfo text>.
Added in version 5. -
contenttype (string)
The MIME content type of the page (without any parameters).
This may have been derived from the Content-Type header,
a <META HTTP-EQUIV> tag, or the URL extension, depending
on what is available. -
contenttypeparams (list)
The names of parameters in the MIME content type, if any. -
contenttypeparam (list, 2 args)
The value(s) of the content type parameter(s) named $which.
Multiple values may be given in $which. Parameter names
are case-insensitive. -
contenttypesrc (string)
Returns the source of contenttype and related data, i.e.
how it was determined. One of "generated", "header",
"doctype", "metaheader", "urlpath",
"contentscan" or "unknown". Added in
version 5.01.1116341784 20050517. Aka contenttypesource. -
cookiejar [all] [netscape4x]
The contents of the "cookie jar" (Vortex's internal cache of
cookies received or set). Returned as a Netscape-cookie-file
format text buffer. By default, only persistent (non-session)
cookies are returned, i.e. the ones to be preserved across browser
invocations. If the argument all is given, all cookies,
including session cookies, are returned. Added in version
4.01.1022000000 20020521.
In version 5.01.1244880000 20090613 and later, a new fifth column
was inserted in the output, containing the IsHttpOnly boolean
value. To obtain the Netscape-4.x-compatible format of prior
versions, set the netscape4x flag. <urlcp cookiejar>
will accept input in either format. -
domvalue $dom
Gets the value of the DOM item indicated by $dom. Note
that this is not the JavaScript DOM, but the near-parallel page
DOM. This can be used to get the submit URL and content for a
form on the page just fetched, e.g. document.forms.myForm.submitUrl
and document.forms.myForm.submitContent, after optionally
setting form input values via <urlcp domvalue>. Added in
version 5. -
downloaddoc (string or varbyte)
The network-transferred downloaded document body. This is the
same as rawdoc if the document had no content/transfer
encodings. If it did have encodings, this is the
chunked/compressed/etc. document, before decompression into
rawdoc. The downloaded document is normally discarded if
different from rawdoc, to save memory; thus it may be empty
for documents with encodings. Set <urlcp savedownloaddoc on>
(normally off) to preserve the downloaded document (at potential
cost in memory). Added in version 5.01.1249203000 20090802. See
also rawdoc, which is usually more useful. -
encodings (list)
The list of content/transfer encodings of the response document,
in the order they were applied by the server. Known encodings
(e.g. gzip) are canonicalized and lowercase. Note that
known and enabled encodings are already decoded (in reverse order)
in the <fetch> or <urlinfo rawdoc> returned document.
Added in version 5.01.1249203000 20090802. -
frames (list)
The list of frame URLs in the document. If the urlcp
setting getframes is true, the list is empty since the
frames have been fetched and appended to the document. -
iframes (list)
The list of <IFRAME> URLs in the document. If the
urlcp setting getiframes is true, the list is empty
since the iframes have been fetched and inserted into the
document. -
headers (list)
The names of HTTP headers received with the document. -
header (list, 2 args)
The full value(s) of the HTTP header(s) listed in $which.
Header names are case-insensitive. -
headervalue $hdrName
The leading value of the header(s) named $hdrName,
where the header is in semicolon-parameterized format, i.e.:
value; param1=val1; param2="val 2"; ...
Added in version 6.00.1287436000 20101018. -
headerparams $hdrName
The parameter name(s) from the semicolon-parameterized header(s)
named $hdrName. Added in version 6.00.1287436000 20101018. -
headerparam $hdrName $paramName
The parameter value(s) of the parameter(s) named $paramName
from the parallel semicolon-parameterized header(s) named
$hdrName. Added in version 6.00.1287436000 20101018. -
errnum (integer)
The Vortex fetch error code (not the HTTP or other protocol
code), indicating a problem with the fetch. This can be non-zero
even for a partially successful fetch, e.g. 15 if the page is too
big. 0 indicates a completely successful fetch. See
here for a list of errnum codes and what
they mean. -
errtoken (string)
A string token representing the numeric errnum code, e.g.
DocNotFound for error 24 (Document not found). This can be
used in scripts as a more readable and self-documenting value than
errnum integer values, and more constant than errmsg
values (which may change in future releases). See
here for a list of tokens and corresponding
numbers and meanings. Added in version 5.01.1246963000 20090707. -
errmsg (string)
A human-readable string description of the errnum code.
See here for a list of possible error messages
and numbers. -
httpcode (integer)
The value of the protocol response code, if any (for HTTP or FTP).
Note that this varies depending on the fetched URL protocol; the
errnum value is more consistent. Typical HTTP codes and
what they mean are listed below. Note that this is not an
exhaustive list, as the protocol code is created and sent by the
web server, not Vortex. Codes will also vary for other (non-HTTP)
protocols, e.g. FTP:
- 200 Ok (all 2NN codes)
- 201 Created
- 202 Accepted
- 204 No Content
- 300 Redirect (all 3NN codes)
- 301 Moved permanently
- 302 Moved temporarily
- 303 See Other
- 304 Not modified
- 400 Bad client request (all 4NN)
- 401 Unauthorized
- 403 Forbidden
- 404 Not found
- 405 Method not allowed
- 406 Method not acceptable
- 407 Proxy access unauthorized
- 408 Request timed out
- 413 Request entity too large
- 414 Request URI too large
- 500 internal server error (all 5xx)
- 501 Not implemented by server
- 502 Bad gateway
- 503 Service unavailable
-
httpmsg (string)
The protocol response string, if any (HTTP or FTP). Varies by
protocol and server; check errmsg instead for more
portable (platform-independent) messages. -
images (list)
The list of image URLs in the document, e.g. <IMG> tags,
background images, etc.
-
links (list)
The list of non-image link URLs in the document, e.g.
<A HREF> tags, <FORM> tags, etc. Same as the return
value of the obsolescent urllinks function. Note that frames will
be listed as links if the urlcp setting getframes
is false, and iframes will be listed as links if getiframes
is false. -
metaheaders (list)
The names of <META HTTP-EQUIV> tags in the document. -
metaheader (list, 2 args)
The entire value(s) of the <META HTTP-EQUIV> tag(s) listed in
$which. Header names are case-insensitive. -
metaheadervalue $hdrName
The leading value of the meta header(s) named $hdrName,
where the header is in semicolon-parameterized format.
Added in version 6.00.1287436000 20101018. -
metaheaderparams $hdrName
The parameter name(s) from the semicolon-parameterized meta header(s)
named $hdrName. Added in version 6.00.1287436000 20101018. -
metaheaderparam $hdrName $paramName
The parameter value(s) of the parameter(s) named $paramName
from the parallel semicolon-parameterized meta header(s) named
$hdrName. Added in version 6.00.1287436000 20101018. -
metanames (list)
The names of <META NAME> tags in the document. -
metaname (list, 2 args)
The entire value(s) of the <META NAME> tag(s) listed in
$which. Names are case-insensitive. -
metanamevalue $hdrName
The leading value of the meta name tag(s) named
$hdrName, where the tag is in semicolon-parameterized
format. Added in version 6.00.1287436000 20101018. -
metanameparams $hdrName
The parameter name(s) from the semicolon-parameterized meta name
tag(s) named $hdrName. Added in version 6.00.1287436000
20101018. -
metanameparam $hdrName $paramName
The parameter value(s) of the parameter(s) named $paramName
from the parallel semicolon-parameterized meta name tag(s) named
$hdrName. Added in version 6.00.1287436000 20101018. -
originalurl (string)
The original URL retrieved (i.e. the one given to fetch or
submit). It may differ from the actual last URL retrieved, e.g.
if redirects were followed. Added in version 5.01.1205285000 20080311.
-
prngdpid (integer, 2 args)
The process ID of the prngd daemon (entropy gatherer)
running on Unix file pipe path $which, or 0 if none
detected. If an empty string is given, all standard paths
("/var/run/egd-pool", "/dev/egd-pool",
"/etc/egd-pool", "/etc/entropy") and the
configured path ([Texis] Entropy Pipe value in
conf/texis.ini) are checked. The prngd daemon is used
on certain Unix platforms (those without /dev/random)
to provide entropy to seed the random number generator for
the SSL/HTTPS plugin. The prngdpid value provides a way
to check if the daemon is running. Note that not all platforms
require an entropy daemon. Added in version 4.01.1031761163 20020911.
See also the entropypipe setting of urlcp
(here). -
putmsgs (list)
The fetch-related putmsgs since the most recent
<fetch> or <submit>. When called inside a
<fetch parallel> loop, only the messages from the just-completed
fetch are returned, making disambiguation much easier than with
the standard <putmsg> function callback mechanism. If
<urlcp putmsg save> is off (here), no messages
will be saved or returned. The message buffer is cleared at the
start of each <fetch> or <submit>. If parsing these
messages, it may be helpful to turn off <urlcp putmsg pass>,
so that the same messages need not be seen and parsed by the
script-wide <putmsg> function callback. Added in version 6. -
processedchunks (strings or varbyte values)
The ordered list of HTML document chunks that were actually
processed during HTML parsing, if different from rawdoc.
The concatenation of these may differ from rawdoc if
JavaScript was run and modified the document; e.g. some of these
chunks may be the output of document.write() statements,
whereas rawdoc is always the static original document. May
be zero-length/empty if no HTML processing was done, e.g. for an
image. Added in version 6. -
processedchunksbufnums (list of integers)
The ordered list of buffer numbers that the corresponding
processedchunks values come from. During HTML and
JavaScript processing, a document will end up with one or more
buffers, the first of which (buffer 0) is the original static
document source itself. JavaScript processing may create further
buffers (e.g. the output of document.write()). A buffer
may end up split into multiple chunks for HTML formatting if
such JavaScript output occurs mid-buffer. For example, a
document.write() in the middle of an HTML page may result
in 3 chunks: the first part of buffer 0 (static doc), buffer 1
(generated by JavaScript), and the latter part of buffer 0 (rest
of static doc). Added in version 6. -
rawdoc (string or varbyte)
The document source (after content/transfer encodings decoded, if
enabled). Same as the return value of the original fetch
or submit. See also downloaddoc. -
redirs (integer)
The number of redirects encountered. -
secure (list)
Which parts of the transaction were conducted securely (via SSL).
One or more of the following values:
-
request - The final URL request to the server was
secure. -
response - The final response from the server was
secure. -
ancestors - All previous requests and responses
that led to the final fetch (i.e. earlier redirects) were
secure. -
descendants - All requests and responses made to
sub-objects on the final page (e.g. frames, scripts) were
secure. -
all - All requests and responses for the entire
transaction - ancestors (if any), final page, and descendants
(if any) - were secure.
Added in version 5.01.1184803500 20070718. Note that the
definition of "secure" for this option only applies to the
first-hop network connection (Vortex); if a proxy is used, the
transaction(s) from the proxy to the URLs may or may not be
secure. See also the insecure option. -
insecure (list)
Which parts of the transaction were insecure, i.e. not
conducted securely via SSL.
One or more of the following values:
-
request - The final URL request to the server was
insecure. -
response - The final response from the server was
insecure. -
ancestors - One or more previous requests or responses
that led to the final fetch (i.e. earlier redirects) were
insecure. -
descendants - One or more requests or responses made to
sub-objects on the final page (e.g. frames, scripts) were
insecure. -
all - The request and response for the final page
were insecure, one or more ancestors (if any) were insecure,
and one or more descendants (if any) were insecure.
Added in version 5.01.1184803500 20070718. Note that the
definition of "insecure" for this option only applies to the
first-hop network connection (Vortex); if a proxy is used, the
transaction(s) from the proxy to the URLs may or may not be
secure. See also the secure option. -
sslservercertificate (PEM string)
Returns the SSL certificate obtained from the server, in PEM
format, or empty if none (e.g. no HTTPS/SSL server contacted). If
the server is an Apache or Texis Monitor web server, this
certificate is typically from the server's
SSLCertificateFile setting. The urlutil action
sslcertificate (here) may be
used to decode the certificate into a human-readable string
format. Note that a server certificate may sometimes be
obtainable from an HTTPS/SSL server even if the connection fails
(e.g. due to verification problems). Added in version
6.00.1320460000 20111104. -
sslclientcalist (list)
Returns the list of CA (certificate authority) certificate names
that the HTTPS/SSL server requested as acceptable issuers of the
client's certificate. (If the server is an Apache or Texis
Monitor web server, this list is typically from the server's
SSLCADNRequestFile or SSLCACertificateFile setting.)
This is a list of certificate issuers that the server indicates it
will accept as signers of the client's (Vortex fetch lib's)
certificate. In other words, the certificate set with <urlcp
sslcertificatefile> (here) should
have been signed by one of these issuers, or the server might
reject the connection with a "Cannot complete SSL handshake:
... alert bad certificate" or similar error.
If an HTTPS/SSL server was not contacted, or the server did not
request a client (Vortex) certificate for verification, this list
may be empty. Added in version 6.00.1320460000 20111104. -
sslverifyservererrtoken (string)
The string token that identifies the reason for the <urlcp
sslverifyserver> error, i.e. the token for the reason part of
the "Cannot verify certificate from
host:port: reason at depth N"
message. If no server-certificate verification was performed
(e.g. sslverifyserver is off, or no SSL server was
contacted), the token is empty or "unknown". If
verification was performed successfully (no errors), "Ok"
is returned.
To continue to verify SSL server certificates - but ignore this
particular sub-type of verification error - this error can be
disabled by adding the token prepended with a "-" (minus
sign) to the <urlcp sslverifyserver>
(here) setting. Added in version
6.00.1320460000 20111104. The list of possible tokens is detailed
in the SSL Client/Server Certificate Verification appendix,
here. Note that disabling
individual sslverifyserver errors should be done with
caution, as it can weaken the security provided by those checks. -
strlinks (list)
The list of JavaScript string links. These may be unreliable or
require further processing, so they are not returned as part of
the normal links list. See also
<urlcp scriptstrlinks>
(here). Added in version
5.00.1086804521 20040609. -
strbaseurls (list)
The list of JavaScript base URLs corresponding to strlinks.
If <urlcp scriptstrlinksabs> is off, this enables the
strlinks list to be made absolute, perhaps after some
post-processing. Added in version 5.00.1086804521 20040609. -
text (string)
The formatted text of the document. Same as the return value of the
obsolescent urltext function. -
textformatter (string)
A token describing what formatter was used to produce the
<urlinfo text> value; one of the following:
-
unknown Formatter is unknown. -
rawdoc No formatting: text is the raw document source. -
text Plain-text document formatter. -
gopher Gopher menu formatter. -
html HTML document formatter. -
frame Framed document formatter/aggregator.
Added in version 5.01.1257475000 20091105. -
title (string)
The formatted title text of the document. -
time or totaltime (double)
The total time in seconds (including fraction) to retrieve the
page. This includes DNS resolution plus content transfer time.
Added in version 3.01.966019604 20000811. -
dnstime (double)
The time in seconds (including fraction) to resolve the
hostname(s) via DNS. Added in version 3.01.966019604 20000811. -
transfertime (double)
The time in seconds (including fraction) to transfer content
to/from the web server. This is a more accurate measure of
web server throughput because it does not include the time to
resolve the hostname(s). Added in version 3.01.966019604 20000811.
The possible errnum, errtoken and errmsg values are:
| | | |
| | errtoken | errmsg |
| 0 | Ok | Ok |
| 1 | ClientErr | Unknown client error |
| 2 | ServerErr | Server error |
| 3 | UnkResponseCode | Unrecognized response code |
| 4 | UnkProtocolVersion | Unrecognized protocol version |
| 5 | ConnTimeout | Connection timeout |
| 6 | UnkHost | Unknown host |
| 7 | CannotConn | Cannot connect to host |
| 8 | NotConn | Not connected |
| 9 | CannotCloseConn | Cannot close connection |
| 10 | CannotWriteConn | Cannot write to connection |
| 11 | CannotReadConn | Cannot read from connection |
| 12 | CannotWriteFile | Cannot write to file |
| 13 | OutOfMem | Out of memory |
| 14 | PageTrunc | Page not expected size, possibly truncated |
| 15 | MaxPageSizeExceeded | Max page size exceeded, truncated |
| 16 | TooManyRedirs | Too many redirects |
| 17 | OffsiteRef | Off-site or unapproved redirect or frame |
| 18 | UnkProtocol | Unknown/unimplemented access method |
| 19 | BadParam | Bad parameter |
| 20 | UnkErr | Unknown error |
| 21 | BadRedir | Bad redirect |
| 22 | DocUnauth | Document access unauthorized |
| 23 | DocForbidden | Document access forbidden |
| 24 | DocNotFound | Document not found |
| 25 | ServerNotImplemented | Server did not recognize request (unimplemented) |
| 26 | ServiceUnavailable | Service unavailable |
| 27 | UnkMethod | Unknown request method |
| 28 | CannotReadFile | Cannot read from file |
| 29 | CannotLoadLib | Cannot load dynamic library |
| 30 | ScriptErr | Script error |
| 31 | ScriptTimeout | Script timeout |
| 32 | ScriptMemExceeded | Script memory limit exceeded |
| 33 | DisallowedProtocol | Disallowed protocol |
| 34 | SslErr | SSL error |
| 35 | ProxyUnauth | Proxy access unauthorized |
| 36 | EmbeddedSecurityChange | Embedded object security change |
| 37 | DisallowedFilePrefix | Disallowed file prefix |
| 38 | DisallowedFileType | Disallowed file type |
| 39 | DisallowedNonlocalFileUrl | Disallowed non-local file URL |
| 40 | CannotConvertCharset | Cannot convert character set |
| 41 | DisallowedAuthScheme | Disallowed authentication scheme |
| 42 | SecureTransNotPossible | Secure transaction not possible |
| 43 | UnexpectedResponseCode | Unexpected server response |
| 44 | DisallowedMethod | Disallowed request method |
| 45 | ConnUpgradeToSslRequired | Connection upgrade to SSL required |
| 46 | FetchNotPermittedByLicense | Fetch not permitted by license |
| 47 | UnknownContentEncoding | Unknown Content- or Transfer-Encoding |
| 48 | DisallowedContentEncoding | Disallowed Content- or Transfer-Encoding |
| 49 | CannotDecodeContentEncoding | Cannot decode Content- or Transfer-Encoding |
| 50 | NotAcceptable | Client-acceptable version not found |
| 51 | CannotVerifyServerCertificate | Cannot verify server certificate |
DIAGNOSTICS
urlinfo returns the requested value(s).
EXAMPLE
<fetch "http://www.somesite.com/mypage.html">
<urlinfo "metanames">
<$names = $ret>
Meta data:
<LOOP $names>
<urlinfo "metaname" $names>
$names = <LOOP $ret> "$ret" </LOOP>
</LOOP>
CAVEATS The urlinfo function was added in version 2.1.884800000 19980114.
If submit is used with TOFILE, then content and
content-derived items such as links are unavailable in
urlinfo, because the content was not held in memory for
processing.
SEE ALSO
fetch, submit, urlcp
Copyright © Thunderstone Software Last updated: Mon Feb 18 10:28:15 EST 2013
|