Navigation Toggle

Webinator as a customizable way to add vertical search engines to multiple industry web portals

October 3, 2007
Webinator as a customizable way to add vertical search engines to multiple industry web portals

When implementing an optimal solution for the heavy search demands of multiple online properties, a website administrator needs a practical way to easily create and provide a high quality retrieval interface to collections of HTML documents. In this article we review how Trade Press Publishing successfully added powerful and flexible vertical search engines to its popular web portals with the help of Thunderstone's Webinator web index and retrieval system. Webinator serves as an example of the type of applications that can be built around Thunderstone's Texis RDBMS and Web Script.

Trade Press Publishing Corporation http://www.tradepress.com) is a privately-held company based in Milwaukee and a leading provider of market intelligence to the facilities management, building service contractor, housekeeping, cleaning supplies distribution and railroad industries. In addition to publishing business-to-business magazines and eNewsletters, it also produces trade shows and conferences, as well as offering variety of related educational and marketing opportunities.

Jesus Carrillo, Director of Information Technology, joined the company's Pre-Press Division more than 16 years ago. According to him, “I started in the Desktop Publishing Department at an entry - level position that was my first job out of college. And I've been at the same place ever since. The company grew. About ten years ago they dissolved the pre-press part of the business to focus on educational media products and business-to-business publishing. They wanted someone to lead their technology efforts, and they asked me to do that. So, I stayed around and have continued to search out technology applications in the b-to-b publishing space.”

Special Requirements to Index and Search Industry-Focused Web Content

Trade Press Publishing Corporation uses Webinator on four “vertical portal” web sites, including two in the facility management space and two in the sanitary distribution/cleaning space. The main site is at FaciltyZone.com (http://www.facilityzone.com.) Carrillo said the biggest reason he selected Webinator as the indexing and searching tool for Trade Press had to do primarily with Webinator's open-ended customizability.

According to Carrillo, “Probably the single identifying characteristic of the Webinator software, for us, was the ability to get to the source code. And that allowed us the flexibility to put it to work to do the things that we wanted to accomplish with its back-end. For example, we were indexing over six thousand web sites, which is quite a bit of data. And the first results that came up were kind of cool. We could see how, out of these millions of pages, you do a search, and there's some logic in there that says 'these are the ten best ones' out of the millions of pages I've got.

“Taking a closer look at them, we felt that to really bring it to a marketplace and have it be as meaningful as possible to our end users -- we needed to go in and play with a lot of the settings to get the search engine to produce the particular kind of results we believed that our users would typically want to find. The straight-out-of-the-box algorithm for searching didn't have an immediate correlation to precisely what we thought our users would be looking for.

“We spent some pretty significant time working with Thunderstone's tech support, doing tests and evaluations and changes and modifications, trial and error, to get things to the point where now it seems on a regular basis the terms we're punching in are getting the types of results that we know will make our users happy,” Carrillo explained.

Webinator's user features include:

  • simple navigation
  • intelligent query capabilities with natural language processes, special pattern matchers (regular expressions, numeric quantities, fuzzy patterns,) document similarity searches, in-context result listings, link reference reports, proximity controls and set logic

Carrillo continued,” The setup and deployment of Webinator is extremely easy and straightforward. All the core functionality is there plus the ability to access the source code and be as creative and as customized as your capabilities will let you be. In other words, Thunderstone doesn't hold you back. Thunderstone lets you take the product to whatever level you're ready, willing and able to take it. For that reason we've stuck with it, we've used it, and it's been great in that regard. That's not something you're going to get from the Googles of the world.”

‘Locked Box’ Approach of Others Inadequate

“We took a look at the Google appliance. It was brand new at the time. And the reason we didn't go with the Google appliance was we had no control over it. No flexibility. No ability to customize. Basically it was a ‘locked box’ sitting in our office, you know? And that's really not the way we wanted to go about it. We've got technical expertise on staff. We can go in. We can study and learn the scripts. We can make our modifications.

“For instance, when you execute a search on faciltyzone.com -- it executes a search first off of a SQL Server database that we've got on our end. Then it goes and executes it against the Webinator database and combines the two sets of results. So, we've got results that are built into a page that kind of fall on top of the results that come out of the search engine. There's no way that you're going to be able to do that with the Google Search Appliance. You just won't be able to.

“The access to the source code and the flexibility of Webinator were definitely both something of value to us. Basically, we could not have done what we did without it. Working with the tech support team at Thunderstone, we have access to people who will actually call you back and work with you on some crazy questions.”

“We're hoping that Thunderstone will continue to be a leader and help pave the way for how search technology is going to evolve. We'd like to take advantage of the new applications that Thunderstone develops and apply them to our industries and to our users,” Carrillo said.

Web portal administrators looking for a web walking and indexing package to help them add vertical search engines to multiple online properties will appreciate the fact that Thunderstone's Webinator:

  • indexes multiple sites into one common index
  • offers administrators detailed verification and logging of document linkages
  • can index/update documents while the database is in use
  • permits multiple databases at a site
  • features a simple browser interface
  • is written in Texis Web Script for complete flexibility
  • provides an SQL query interface to the database for maintenance and reports
  • allows remote sites to be copied to the local file system
  • lets multiple index engines run concurrently against a common database

Change In Daylight Saving Rules

September 28, 2007

Are Thunderstone Software's products affected?

Texis and Webinator do not require any patches to accommodate the upcoming changes to daylight saving rules as they store dates in UTC, and use the configured timezone to output and import dates. Daylight saving tracking is a feature provided by the operating system and Texis will respect the OS configuration. However, your OS may need patching if it's not already properly configured to handle the new rules.

 

Thunderstone will be issuing patches for the Search Appliance on March 1 (possibly earlier). Use your appliance's "Maintenance->Check for updates" feature if you haven't configured it for automatic updates. Select the package called "timezone-1.0.0".

Are the dates in my database correct?

If you have used, or are using Texis on an unpatched OS, and you import or convert string dates whose string values lie between the old and new DST changeover dates (e.g. Sun Mar 11 02:00:00 2007 and Sun Apr 01 02:00:00 2007 local time for the US), for any year after 2006, then the imported value will be based on the prior rules, and will be output differently after the patch because the OS parsed it wrong. You will need to re-import/convert those dates after patching your OS.

For example the string "2007-03-20 16:30:00" (4:30pm on March 20, 2007) imported to a date field on an unpatched operating system configured for a US timezone will print as "2007-03-20 17:30:00" (5:30pm on March 20, 2007) after patching.

Updating your operating system

Unix and Linux

Your best bet would be to check with your OS distribution vendor for timezone or tzdata updates. Here are links to a few popular ones.
For Solaris see Sun Alert ID 102775.
For RedHat see knowledge base article 7909.
For SUSE see document 3853518.
For doit-your-selfers check out tzdata.tar.gz at ftp://elsie.nci.nih.gov/pub/ for new timezone data. Be sure to link or copy your timezone file to /etc/localtime or whatever's appropriate for your system.

Microsoft Windows

Visit windowsupdate.com to download Update KB928388 in the Optional category. Note that this fix is NOT included in automatic updates. For full details from Microsoft visit http://www.microsoft.com/windows/timezone/dst2007.mspx.

Testing compliance for Texis and Webinator

Note: The following tests assume a US timezone that used and will use the conventional DST rules. Some localities and other countries follow different rules.
The lines here may wrap to fit the page. Enter each command on one long line.

Open a shell or msdos window and cd to the Texis install directory. Then run (for Windows use "texis"1 instead of "bin/texis" in the examples below)

bin/texis -h -d texis/testdb -s "select convert('2007-03-11 03:01:00','date')-convert('2007-03-11 01:59:00','date')"

This test compares one minute before and after the new transition time.
Unpatched you should get "3720". Patched you should get "120".

Then run

bin/texis -h -d texis/testdb -s "select convert('2007-04-01 03:01:00','date')-convert('2007-04-01 01:59:00','date')"

This test compares one minute before and after the old transition time.
Unpatched you should get "120". Patched you should get "3720".

 


1Particularly old installations of Texis may not havetexis.exe in the installation directory but only in the webserver's CGI directory. In that case use the full path to texis.exe to run it.

Confirming updates on the Search Appliance

After installing the timezone-1.0.0 package via "Check For Updates" confirm the installation by going to "Maintenance->Manage Logs" and clicking on "messages". You should see something similar to the following, but with your machine name and timezone (note that the lines are in reverse chronological order).

Feb 27 14:12:58 host logger: timezone finished
Feb 27 14:12:58 host logger: patch ok
Feb 27 14:12:57 host logger: Your timezone is America/New_York
Feb 27 14:12:57 host logger: Installing updated timezone info
Feb 27 14:12:57 host logger: timezone-1.0.0-1
Feb 27 14:12:57 host logger: Preparing packages for installation...

If necessary you can adjust your timezone setting via "Maintenance->Webmin Interface->Change Time Zone".

Further questions

Please contact Thunderstone Support if you have questions.

Webinator as an indexing and retrieval tool for creating vertical search portals on network hubs

September 21, 2007
Webinator as an indexing and retrieval tool for creating vertical search portals on network hubs

Ecological Internet (EI) maintains up-to-date climate, forests and environment portals that serve more than 35,000 visitors a day. By implementing Thunderstone's Webinator, EI enables its website users to search the indexed content of five million URLs and quickly retrieve the desired information.

Why Ecological Internet?

Having earned his B.A. degree in Political Science at Marquette University, Glen R. Barry joined the Peace Corps and went to Papua New Guinea -- where he fell in love with the rainforest while witnessing the tragedy of their very extensive destruction for the sake of making cardboard boxes and other such stuff.

According to him, “During my Peace Corps service in Papua New Guinea from about 1990 I became an early adopter of the Internet and began looking seriously at how networking technologies could be used to facilitate environmental conservation. In the early days of the Internet I was struck by the fact that communication between people anywhere in the world could be used to spread information that would lead to better resource management decisions and better conservation decisions.”

After returning from the Peace Corps he completed an M.S. degree in Conservation Biology and Sustainable Development, as well as a Ph. D. in Land Resources, both from the University of Wisconsin-Madison. His primary research revolved around the creation and maintenance of environmental web portals such as Forests.org -- which became one of the first 10,000 web sites on the Internet. Dr. Barry's Ph.D. dissertation was entitled Global Forests and the Internet: Assessing the Reach and Usefulness of the Forest Conservation Portal.

In 1999 he decided to add search capabilities to Forests.org, while also launching a climate site and an environmental sustainability site.

Customized Search Engine for Web Sites

Dr. Barry explained, “We wanted to be able to make our own customized search engine. We preferred an off-the-shelf solution that we could easily install to crawl, index, search and retrieve content from more than 4,000 reviewed scientific-content sites of interest to our target audience of conservation professionals. I remember searching on the Internet and finding a huge list of spidering and robot software that had about a hundred products on it. A lot of them were ‘open source,’ with little snippets of code. I was more concerned with having a fully implemented product that does what you need it to do. I wasn't interested in doing an open source sort of thing. Where do you go for technical support in those situations? Going through the list, most of them weren't fully implemented packages. Many of them were free, but the amount of time that a small organization would need to spend getting them operational would have offset any cost benefits. There were a few other options, but they were going to be much more expensive than Webinator.

“At that time our entire budget was like fifteen thousand dollars a year (even now it's only about seventy-five thousand dollars a year in mostly $25 - $100 donations.) So, we're a really small organization. We chose Webinator. I think our initial license with Thunderstone was eight thousand dollars, which was a major purchase for us. It was a big deal. We were trying to do something that hadn't been done before. We had a vision that we wanted to create a specialized search engine on forests content, on climate change content and on water conservation content. The whole purchasing and installation process was straightforward. And Webinator was very, very stable. It just ran. I'm using it on a Windows platform. My operating system is Windows.

“We wanted to walk about four thousand sites we were feeding, and then we also wanted to do off-site pages. Here's where I think customized search is so good. Not only are we getting the content of the reviewed four thousand sites that I as a scientist have identified, but also each of those sites has links to other sites that are included in our index. So, you have some synergy where you find unexpected things at other good sites. Webinator is a really well thought-out product that has a lot of different tools built into it. It's a full-functioning web indexing and retrieval package. You can even include or exclude specified external links. For instance, we don't want Green Peace's online store and merchandise in our search engine..”

Network “Hubs” to Support Environmental Professionals

Ecological Internet (EI) does not directly focus on the general audience that's looking for fluffy pictures of panda bears. There are other web sites that do that very well. EI's target audience is primarily conservation professionals who need information retrieval tools and who seek useful data to factually support their own work. These people tend to be already highly motivated on the issues, and what they get from Ecological Internet are practical tools to do their work better.

Dr. Barry had been employed in the biology department at the University of Wisconsin as their ‘bioinformatics person’ until he left several years ago to run Dennmark, Wisconsin-based Ecological Internet, Inc. (http://www.ecologicalinternet.org) on a full-time basis.

“There's a whole branch of science, network science, that over the last decade has studied how diseases spread or how the Internet's organized in a ‘hub’ design comprised of nodes with disproportionately high numbers of links to them. It's like the whole Kevin Bacon ‘six degrees of separation.’ We're all networked, and there are hubs. The Internet is a good demonstration of a lot of these networks. What we tried to do with Ecological Internet was to make a network hub on climate change, a network hub on forests, etc. where all of the best content is linked, indexed and made available in support of intelligent activities to protect the environment. Part of this is awareness, but it's awareness with a purpose to actually achieving something. There is reason to be hopeful. The forces of ignorance and corruption are ominous, but we have new tools - like Webinator - that we've never had before,” said Dr. Barry.

He continued, “I went up there to Thunderstone's headquarters in Cleveland, Ohio to participate in a Webinator training program two years ago. I had already been using the product for six years. During this whole time I think that the Thunderstone Software team has always been very responsive. I don't know of any other comparable product that brings full-text customized search to non-profits at a reasonable price. We wholeheartedly support Thunderstone and would recommend the Webinator search platform highly..”

Ecological Internet (EI) now maintains up-to-date climate, forests and environment portals that serve more than 35,000 visitors a day. By implementing Webinator, EI enables its website users to search the indexed content of five million URLs and quickly retrieve the desired information.

The nonprofits' conservation portals currently include:

EcoEarth.Info (http://www.ecoearth.info)
ClimateArk.org (http://www.climateark.org)
WaterConserve.org (http://www.waterconserve.org)
Forests.org (http://www.forests.org)
OceanConserve.org (http://www.ocenconserve.org)
My.EchoEarth.Info (http://my.ecoearth.info)

Ecological Internet Portals Tap Thunderstone To Search 5 Million Urls

August 24, 2007

Nonprofits' Conservation Websites Inform More than 35,000 Visitors a Day

CLEVELAND,OH — August 24, 2007 — Thunderstone Software LLC announced today that Dennmark, Wisconsin-based Ecological Internet, Inc. has renewed its license of Thunderstone's Webinator Web Index & Retrieval System to continue offering industry-leading search capabilities on all its environment conservation websites, including the highly popular http://www.ecoearth.info and http://www.climateark.org.

Ecological Internet, Inc. is a non-profit organization specializing in the use of the Internet to achieve conservation outcomes. As part of its mission it seeks:

  • to empower the global movement for environmental sustainability by working to conserve climate, forest, ocean and water ecosystems
  • to commence the age of ecological sustainability and restoration
  • to provide forest/rainforest, climate, water, ocean and environment conservation websites—presented as a free service to the environmental community.

 

Ecological Internet maintains up-to-date climate, forests and environment portals that serve more than 35,000 visitors a day. It enables its website users to quickly search the indexed content of five million URLs and retrieve the desired information.

Thunderstone's Webinator Web Index & Retrieval System allows a website administrator to easily create and provide a high quality retrieval interface to collections of HTML documents. Webinator serves as an example of the type of applications that can be built around Thunderstone's Texis RDBMS and Web Script.

"Thunderstone's Webinator program has been the foundation of our efforts to build web sites that facilitate environmental conservation by providing accurate information for action," explains Ecological Internet President Dr. Glen Barry. "There is no other comparable product that brings full-text customized search to non-profits at a reasonable price. We wholeheartedly support Thunderstone and would recommend the Webinator search platform highly."

About Thunderstone
Thunderstone Software LLC pioneered simultaneous searching of both structured and unstructured data with the Texis relational database optimized for full text search. Since 1981 Thunderstone has continued to develop its global reputation as provider of the world's most powerful, scalable and flexible enterprise search solutions.

Sales Contact: Mark Bacho

+1 216 820 2200 ext.105

Media Contact: Peter Thusat

+1 216 820 2200 ext.118

Thunderstone To Power Hollywood.com Search Engine

June 15, 2004

CLEVELAND, OH -- (MARKET WIRE) -- 06/15/2004 -- Thunderstone Software announced its sale of a search system for the Hollywood.com website. Thunderstone's Webinator product will provide an up-to-date indexing of around one million movie reviews, entertainment listings, and related information appearing on the re-designed new and improved Hollywood.com web site.

Webinator is one of the original web-search software products, introduced in 1995 and now in major release 5. Webinator powers search features for many hundreds of web sites in many languages around the world.

Hollywood.com is one of the leading movie-related sites on the Internet, featuring movie reviews, showtimes listings, entertainment news, and an extensive multimedia library. Hollywood.com serves more than one billion web pages annually.

"We found Webinator the best technology for searching our content under our requirements," said Laurie S. Silvers, President of Hollywood.com. "Its many advanced features were crucial, such as its ability to reindex fast-changing subsets of the site quickly, and its ability to crawl through JavaScript links. Our site structure is somewhat complicated but Webinator handles it beautifully."

Thunderstone's general manager, John Turnbull, said: "We built many features into Webinator to optimize it for indexing large complex web sites. We're happy to see that effort put to good use on Hollywood.com."

Thunderstone's Webinator is available in a variety of configurations including a free version. For more information, see http://www.webinator.com.

ABOUT THUNDERSTONE

Thunderstone Software LLC is a pioneer of search engine technology, providing text retrieval products to industry, government, and educational institutions since 1981. Thunderstone's flagship Texis software is the most versatile platform for search application development. Texis is at the heart of the Webinator and Search Appliance products. Texis uniquely integrates natural language and relevance ranking functionality with structured SQL relational database indexing. Applications of Thunderstone technologies include: online publishing, product catalogs, classified advertising, document management, text mining, web searching, and intelligence. Thunderstone's products are used on thousands of web sites worldwide. Major customers include eBay, Corbis, QVC, About.com, ZDNet, and HotJobs. For more information, visit http://www.thunderstone.com or call +1 216-820-2200.

Recent