August 2009 - Archive
- Upcoming Events
- Customer Quote of the Month
- Tech Tip: Directing your crawler with your content – using robots.txt and meta-robots
- Subscription/Unsubscription and Contacts
LATEST SEARCH APPLIANCES FEATURED IN KMWORLD
A front-page article by ArnoldIT.com's Stephen E. Arnold highlights the newest versions of the Thunderstone Search Appliance and the Google Search Appliance in the printed July/August edition of KMWorld, a publication of Information Today, Inc.
The author says Thunderstone's high-performance, flexible appliances give administrators and developers excellent control over the system – with strong document-level security, feature-rich tuning controls and ability to schedule, stop, pause or configure database crawls in the same way as they can for file servers, Web servers, intranet servers, etc.
You can read the KMWorld article, entitled "Making room for appliances," online here.
THUNDERSTONE ADDS A RESELLER IN AUSTRALIA
We welcome the following organization to our growing Thunderstone Channel Partner Program:
(For Search Appliances and Webinator)
1300 737 078
Development work continues on 2009 Thunderstone Software releases of:
- TEXIS (Version 6)
- Webinator (Version 6)
- Thunderstone's Texis Catalog (eCommerce search engine for online catalogs)
"What we have in this particular case is a Native American user group thesaurus language. It's been developed, and it can be added to. The more that it's used – and you put that feedback loop back into this thesaurus – the smarter it becomes. And it starts to create, with this new millennium, a written mind that parallels the thesaurus user group's community. This is something that TEXIS is equipped to deal with that the other stuff out there is not equipped to deal with. It's part of its strength."
Chief Technology Officer
Mnemotrix Systems, Inc.
Robots.txt refers to a way for a website to indicate to crawlers what parts of the site it would like the crawler to stay out of. This is not specific to Thunderstone software, and it applies to other web crawlers too. There are two ways of accomplishing this:
- With a "robots.txt" file at the root of the webserver, such as http://www.thunderstone.com/robots.txt. This can be used by site maintainers to tell crawlers to stay out of entire sections. For details of the syntax, please see http://www.robotstxt.org/.
- Within an individual HTML page as a custom header, such as <meta name="robots" content="noindex,nofollow">. This allows page authors to control the indexing of the content and following of links of an individual page, without affecting others around it.
Note that these are guidelines – they do not create technical restrictions that prevent a crawler from descending into a directory or following links, and they should not be used for security purposes.
All Thunderstone products obey robots.txt and meta robots by default. Sometimes you need to index content you don't control that has a robots exclusion on it. If you'd like to ignore the robots reccomendations and index the content, there are Walk Settings that allow this:
- Robots.txt (Y or N) – whether or not to obey the /robots.txt on a site (if it exists). Defaults to Y.
- Meta Robots (Y or N) – whether or not to obey any meta robot headers found within an individual page. Defaults to Y.
- Robots Placeholder (Y or N) – When a page is excluded from the crawl via robots, this affects whether a "placeholder" record is kept in the crawl data. The placeholder keeps the page from being visited unnecessarily, but it can cause the page to show up in searches where the URL is being searched (by being included in Index Fields.) Defaults to Y.
Feedback, suggestions and questions are welcome. Send your email to