|
Data is submitted to Webinator with an HTTP POST request sent to
a similar URL as the admin interface (eg. http://.../dowalk),
but with /recvdata.xml appended. Eg.:
http://www.mysite.com/texis/webinator/dowalk/recvdata.xml
The following POST variables must be set in the request. Be sure to
URL-encode the values:
-
profile
Set to the name of the receiving profile. -
data
Set to an XML document containing the data, and what to do with it
(insert/delete/etc.). See below for details.
Specifying all fields manually
Below is an example data document where all fields are
specified. Be sure to HTML-encode values.
<?xml version="1.0" encoding="UTF-8"?>
<ThunderstoneReplication
xmlns:dt="urn:schemas-microsoft-com:datatypes"
>
<Item>
<Type>I</Type>
<Size>150369</Size>
<Visited>2005-10-25 15:25:18</Visited>
<Dlsecs>0</Dlsecs>
<Depth>0</Depth>
<Url>http://www.mysite.com/dir/page.html</Url>
<Title>Sprocket Specifications</Title>
<Body>...</Body>
<Keywords>sprockets, gears, hubs</Keywords>
<Description>Sprocket details</Description>
<Meta></Meta>
<Category>Mechanical</Category>
<Modified>2005-10-25 11:21:07</Modified>
<NextCheck>2005-10-25 16:25:18</NextCheck>
<Views>0</Views>
<Clicks>0</Clicks>
<CTR>0.000000</CTR>
<Pop>0</Pop>
<MimeType>text/html</MimeType>
<Charset>UTF-8</Charset>
<Refs dt:dt="bin.base64">...</Refs>
<Errors dt:dt="bin.base64">...</Errors>
<RawData dt:dt="bin.base64"></RawData>
</Item>
</ThunderstoneReplication>
Any element whose text data might not be XML-safe (eg. binary chars in
the <Body>) should be base64-encoded, and the attribute
dt:dt="bin.base64" set in the tag. Eg. the <Refs> and
<Errors> elements' text data are always base64-encoded. Note
that the XML namespace prefix dt should also then be set to
urn:schemas-microsoft-com:datatypes in the root
<ThunderstoneReplication> element.
The elements are:
-
<Type>
The action to take with this data. Text value may be one of:
-
I Insert the data (overwrite previous data for URL if any) -
D Delete the URL -
DP Delete the URL as a pattern (eg.
http://www.mysite.com/dir/*) -
UI Update search indexes (call after a batch of
inserts/deletes)
-
<Size>
The integer size of the original document. -
<Visited>
When the document was fetched, in YYYY-MM-DD HH:MM:SS format. -
<Dlsecs>
Number of seconds taken to download the document. -
<Depth>
Depth of URL from a Base URL, eg. 0 is a Base URL, 1 is one click away,
etc. -
<Url>
The URL of the document. -
<Title>
The title of the document. -
<Body>
The formatted body of the document. -
<Keywords>
Any keywords for the document. -
<Description>
The description of the document. -
<Meta>
Any meta data for the document. -
<Category>
The category the document is in, if any. Must be a category name
from the profile's Categories. -
<Modified>
The Last-Modified date of the document,
in YYYY-MM-DD HH:MM:SS format. -
<NextCheck>
When the document should be refreshed,
in YYYY-MM-DD HH:MM:SS format. -
<Views>
Number of views of the document: how many times it's been shown in
search results. -
<Clicks>
Number of clicks of the document: how many times it's been clicked on
in search results. -
<CTR>
Click-through-ratio: floating-point number ratio of clicks to views. -
<Pop>
Document popularity: number of references (links) to it. -
<MimeType>
The MIME type of the content served at the URL, or provided in RawData. -
<Charset>
Character set of <Body> data. Should correspond with
Storage Charset profile setting (here).
If a charset other than the Storage Charset is used, it
should be a standard IANA charset that Webinator can convert
to the Storage Charset. -
<Refs>
Optional element with references (child links) of the document. -
<Errors>
Optional element with errors of the document.
Uploading a binary file
If you have a binary file, such as a PDF or an Office document, you
can send it with the dataload API and let the Webinator extract the
text from it.
<?xml version="1.0" encoding="UTF-8"?>
<ThunderstoneReplication
xmlns:dt="urn:schemas-microsoft-com:datatypes"
>
<Item>
<Type>I</Type>
<Url>http://www.example.com/dataload.pdf</Url>
<RawData dt:dt="bin.base64">0M8R4KGxGu....</RawData>
</Item>
</ThunderstoneReplication>
The elements are:
-
<Type>
The action to take with this data. Text value may be one of:
-
I Insert the data (overwrite previous data for URL if any)
-
<Url>
The URL of the document. -
<RawData>
element with the base64 encoding of raw document. It must include
the dt:dt="bin.base64" attribute.
Combining the two: binary files with custom fields
It is possible to specify both a <RawData> document, and
fields such as <Title>, <Description>, etc. The binary
document will be processed, and any other fields provided will
override the values that came from the document.
This can be useful in situations where you have a Content Management
System (CMS) that contains metadata about a document that doesn't
actually occur anywhere in the document. You can do a custom
dataload that pushes in the document, and the custom Title/Description/etc.
Additional Fields
Each profile-specific Additional Field is optionally sent in a
single element named after the field, with the XML namespace prefix
u. The value of the field is the content of the XML element.
Note that the u XML namespace prefix should be declared in the
root <ThunderstoneReplication> node, as shown earlier.
For example, an Integer field Quantity and a
Text field State may be given as:
<u:Quantity>57</u:Quantity>
<u:State>NY</u:State>
Other Details
The optional <Refs> element lists the links (references) from
the given document, for parent-child linking. Its text value is a
base64-encoded XML document with the following format when decoded:
<results xmlns:dt="urn:schemas-microsoft-com:datatypes">
<result>
<Url>http://www.mysite.com/dir/page.html</Url>
<Ref>http://www.mysite.com/dir/otherpage.html/</Ref>
</result>
...
</results>
Each <Url> should be the same as the <Url> in the above
<Item> block. The <Ref> is a single link from the page.
Only one <Ref> may be listed per <result>; additional
links should be sent with additional <result> elements.
The optional <Errors> element contains any errors to be logged
for the document. Note that this may be empty or not present if no
errors are to be logged. Its text value is a base64-encoded XML
document with the following format when decoded:
<results xmlns:dt="urn:schemas-microsoft-com:datatypes">
<result>
<Url>http://www.mysite.com/dir/page.html</Url>
<Reason>Document not found: 404 (Not Found)</Reason>
</result>
...
</results>
As with the <Refs> element, the <Url> must correspond
with the original <Item> <Url>, and multiple errors
must be listed in separate <result> elements.
Copyright © Thunderstone Software Last updated: Thu Dec 22 14:38:01 EST 2011
|