Submission Format

Note: This documentation is for an old version of Webinator. The latest documentaion is here.

Submission Format

Data is submitted to Webinator with an HTTP POST request sent to a similar URL as the admin interface (eg. http://.../dowalk), but with /recvdata.xml appended. Eg.:

http://www.mysite.com/texis/webinator/dowalk/recvdata.xml

The following POST variables must be set in the request. Be sure to URL-encode the values:

profile Set to the name of the receiving profile.
data Set to an XML document containing the data, and what to do with it (insert/delete/etc.). See below for details.

Specifying all fields manually

Below is an example data document where all fields are specified. Be sure to HTML-encode values.

<?xml version="1.0" encoding="UTF-8"?>
<ThunderstoneReplication
      xmlns:dt="urn:schemas-microsoft-com:datatypes"
>
  <Item>
    <Type>I</Type>
    <Size>150369</Size>
    <Visited>2005-10-25 15:25:18</Visited>
    <Dlsecs>0</Dlsecs>
    <Depth>0</Depth>
    <Url>http://www.mysite.com/dir/page.html</Url>
    <Title>Sprocket Specifications</Title>
    <Body>...</Body>
    <Keywords>sprockets, gears, hubs</Keywords>
    <Description>Sprocket details</Description>
    <Meta></Meta>
    <Category>Mechanical</Category>
    <Modified>2005-10-25 11:21:07</Modified>
    <NextCheck>2005-10-25 16:25:18</NextCheck>
    <Views>0</Views>
    <Clicks>0</Clicks>
    <CTR>0.000000</CTR>
    <Pop>0</Pop>
    <MimeType>text/html</MimeType>
    <Charset>UTF-8</Charset>
    <Refs dt:dt="bin.base64">...</Refs>
    <Errors dt:dt="bin.base64">...</Errors>
    <RawData dt:dt="bin.base64"></RawData>
  </Item>
</ThunderstoneReplication>

Any element whose text data might not be XML-safe (eg. binary chars in the <Body>) should be base64-encoded, and the attribute dt:dt="bin.base64" set in the tag. Eg. the <Refs> and <Errors> elements' text data are always base64-encoded. Note that the XML namespace prefix dt should also then be set to urn:schemas-microsoft-com:datatypes in the root <ThunderstoneReplication> element.

The elements are:

<Type> The action to take with this data. Text value may be one of:
- I Insert the data (overwrite previous data for URL if any)
- D Delete the URL
- DP Delete the URL as a pattern (eg. http://www.mysite.com/dir/*)
- UI Update search indexes (call after a batch of inserts/deletes)
<Size> The integer size of the original document.
<Visited> When the document was fetched, in YYYY-MM-DD HH:MM:SS format.
<Dlsecs> Number of seconds taken to download the document.
<Depth> Depth of URL from a Base URL, eg. 0 is a Base URL, 1 is one click away, etc.
<Url> The URL of the document.
<Title> The title of the document.
<Body> The formatted body of the document.
<Keywords> Any keywords for the document.
<Description> The description of the document.
<Meta> Any meta data for the document.
<Category> The category the document is in, if any. Must be a category name from the profile's Categories.
<Modified> The Last-Modified date of the document, in YYYY-MM-DD HH:MM:SS format.
<NextCheck> When the document should be refreshed, in YYYY-MM-DD HH:MM:SS format.
<Views> Number of views of the document: how many times it's been shown in search results.
<Clicks> Number of clicks of the document: how many times it's been clicked on in search results.
<CTR> Click-through-ratio: floating-point number ratio of clicks to views.
<Pop> Document popularity: number of references (links) to it.
<MimeType> The MIME type of the content served at the URL, or provided in RawData.
<Charset> Character set of <Body> data. Should correspond with Storage Charset profile setting (here). If a charset other than the Storage Charset is used, it should be a standard IANA charset that Webinator can convert to the Storage Charset.
<Refs> Optional element with references (child links) of the document.
<Errors> Optional element with errors of the document.

Uploading a binary file

If you have a binary file, such as a PDF or an Office document, you can send it with the dataload API and let the Webinator extract the text from it.

<?xml version="1.0" encoding="UTF-8"?>
<ThunderstoneReplication
      xmlns:dt="urn:schemas-microsoft-com:datatypes"
>
    <Item>
        <Type>I</Type>
        <Url>http://www.example.com/dataload.pdf</Url>
        <RawData dt:dt="bin.base64">0M8R4KGxGu....</RawData>
    </Item>
</ThunderstoneReplication>

The elements are:

<Type> The action to take with this data. Text value may be one of:
- I Insert the data (overwrite previous data for URL if any)
<Url> The URL of the document.
<RawData> element with the base64 encoding of raw document. It must include the dt:dt="bin.base64" attribute.

Combining the two: binary files with custom fields

It is possible to specify both a <RawData> document, and fields such as <Title>, <Description>, etc. The binary document will be processed, and any other fields provided will override the values that came from the document.

This can be useful in situations where you have a Content Management System (CMS) that contains metadata about a document that doesn't actually occur anywhere in the document. You can do a custom dataload that pushes in the document, and the custom Title/Description/etc.

Additional Fields

Each profile-specific Additional Field is optionally sent in a single element named after the field, with the XML namespace prefix u. The value of the field is the content of the XML element. Note that the u XML namespace prefix should be declared in the root <ThunderstoneReplication> node, as shown earlier.

For example, an Integer field Quantity and a Text field State may be given as:

<u:Quantity>57</u:Quantity>
<u:State>NY</u:State>

Other Details

The optional <Refs> element lists the links (references) from the given document, for parent-child linking. Its text value is a base64-encoded XML document with the following format when decoded:

<results xmlns:dt="urn:schemas-microsoft-com:datatypes">
  <result>
    <Url>http://www.mysite.com/dir/page.html</Url>
    <Ref>http://www.mysite.com/dir/otherpage.html/</Ref>
  </result>
  ...
</results>

Each <Url> should be the same as the <Url> in the above <Item> block. The <Ref> is a single link from the page. Only one <Ref> may be listed per <result>; additional links should be sent with additional <result> elements.

The optional <Errors> element contains any errors to be logged for the document. Note that this may be empty or not present if no errors are to be logged. Its text value is a base64-encoded XML document with the following format when decoded:

<results xmlns:dt="urn:schemas-microsoft-com:datatypes">
  <result>
    <Url>http://www.mysite.com/dir/page.html</Url>
    <Reason>Document not found: 404 (Not Found)</Reason>
  </result>
  ...
</results>

As with the <Refs> element, the <Url> must correspond with the original <Item> <Url>, and multiple errors must be listed in separate <result> elements.