Navigation Toggle

Case Study: Thunderstone Helps JMI Manage its Large Collection of Medical Documents

April 3, 2018
Case Study: Thunderstone Helps JMI Manage its Large Collection of Medical Documents

Who We Helped

The Jeghers Medical Index (JMI) is a medical library in St. Elizabeth Youngstown Hospital (SEYH). The JMI maintains Dr. Harold Jeghers’ collection of approximately one million historical medical journal articles that date from the late 1800s to the 1990s.

Their Challenge

After decades of amassing medical articles, JMI’s collection had grown to roughly five million physical pages, which filled more than 165 cabinets. As JMI began to digitize its collection, it was apparent that a flexible data management system that specialized for storage and searching was necessary. The professionals at JMI considered and compared several data management vendors on the following criteria:

  • Cost
  • Relevance
  • User interface
  • Methodology
  • Location of corporate headquarters
  • Willingness to customize software

After weighing all these factors, they decided that Thunderstone and TEXIS, its fully-integrated, full text search engine software platform, was the best fit for their project.

Thunderstone’s Solution

The TEXIS system is capable of hosting and managing a large collection of documents online. For JMI, TEXIS manages roughly one million articles.

With this much available information, Thunderstone wanted to provide several ways for users to find what they needed from JMI. This process included a custom search to match OCR information from hand annotated copies of articles to add key elements like PubMed XML citations and a table of anatomic terms that Jeghers created for TEXIS so that their users could find relevant articles from the original collection. Thunderstone also programmed TEXIS to offer JMI users three different search options to assist in article retrieval.

Boolean Logic

This option is patterned around search and retrieval behaviors like PubMed, an archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine. Users can use this search method to query based on the author, title, journal, abstract, and medical subject headings (MeSH).

Proximity Search

The proximity search option retrieves articles based on the occurrences of terms and location of words in sentences, paragraphs, pages, or documents. Proximity search can extract concepts contained in articles by using the relationships between words and the clustering of words with concepts to its advantage.

Jeghers Indexing

A third search option is based on the indexing method developed by Dr. Jeghers himself. This search relies on concepts such as disease, organ system, physiology, and anatomy to retrieve folders containing articles close to the search subjects.

After users made a query, the display results offered more options to organize the bibliographic data if they wanted to pare down their search even further. At this point, users can list results by various facets, including year of publication, study type, or with or without abstracts. Once a document or documents are found, the user may request a fair use copy from a JMI librarian.

JMI, a medical index powered by Thunderstone search engine software.

What Happened

As a collection of historical medical articles, the JMI allows medical professionals, professors, and students to glean more information about the past. Since the addition of TEXIS, JMI has noticed an increase in usage of the collection, with more than 2,000 hits or searches recorded over the course of a year. JMI searches are not restricted to SEYH staff, as the database is utilized by people across the world.

While the project took hundreds of hours of work to gather and upload the collection, TEXIS now has helped JMI turn into a compact, low-maintenance collection that is curated by a single full-time librarian. Since shifts in health care terms, meanings, and concepts can impact retrieval accuracy, Thunderstone can make ongoing changes to JMI’s TEXIS platform to help it improve results retrieval over time.

Work with Thunderstone Today

If your organization has a big project that requires accurate, effective search results, Thunderstone can help. Learn more about our search engine software and appliances and contact us today to talk to one of our experts about how Thunderstone can help you solve your search problems.

Texis Overview

January 9, 2013
Texis Overview

Executive Summary:

  • TEXIS is the only fully integrated SQL RDBMS optimized for full-text search.
  • TEXIS has high-performance ability to intelligently query and manage databases containing natural language text, numeric values, standard data types, geographic information, images, video, audio and other payload data.
  • TEXIS powers real-time applications with zero-latency data insertion, providing immediate search availability of key data without waiting for scheduled index updates.
  • TEXIS efficiently sorts and groups search results by any field(s) in the data. It can quickly sort tens of thousands of hits or more.
  • TEXIS, the innovative development platform behind Thunderstone's entire line of enterprise search products, lets users and developers incorporate their own unique knowledge and expertise into customized search solutions that easily integrate with other applications.

What makes Texis different from other search engines and databases?

Thunderstone's Texis is the only search engine developed from the ground up as a fully integrated SQL RDBMS optimized for full-text search, and it's the only relational database that can store and search text documents of unlimited size within standard database tables.

Used by hundreds of thousands of database application developers around the world, Structured Query Language provides many advantages for satisfying complicated search requirements. SQL also holds great promise as a reliable, well-defined path for implementing unanticipated new search functionality in the future. All other search engines offer a much narrower range of possibilities based on proprietary interfaces.

While typical search engines do a nice job of searching unstructured text and traditional databases have an impressive ability to handle queries on fielded or structured data, text searching and relational database management each rely upon radically different paradigms for organizing and retrieving information. They both developed and matured over decades as completely separate technologies, and they don't “marry” easily.

Thunderstone is the only company that has accomplished the true marriage of a full-text search engine and a powerful SQL relational database in a single platform. Addressing this challenge, the simultaneous searching of structured and unstructured data, remains one of Thunderstone's core competencies.

Deep in the heart of the Texis RDBMS resides Thunderstone's Metamorph, a concept-based natural language search engine utilizing advanced lexical set logic.

Metamorph has often been classified as a form of Artificial Intelligence, since its functions fall into the categories of knowledge acquisition, natural language processing and intelligent text retrieval. The software attempts in its own way to understand your search queries, to represent its understanding to the data in the files and to come up with relevant responses as retrieved portions of full-text information which best correspond to your submitted queries.

Metamorph's starting vocabulary has 250,000+ word connections, constructed in a dense web of associations and equivalences. Search parameters can be adjusted to dynamically dictate surface and deep inference. The program's responses can be controlled so that they are direct or abstract in relation to user queries. Proximity of concept can be fine tuned so as to qualify degree of relevance, providing matches which are sometimes concrete, sometimes abstract, as desired.

Metamorph allows for editing word sets. This means that you may select which associations you would like in connection to any search. You can create your own concept sets permanently for future use. You can fine tune the search to use associations of only a certain part of speech. You can enter all known spelling variations of any particular search word in the same way. You can generally customize the program to include your own nomenclature and vocabulary, making it increasingly intelligent the longer it is in use. When you want to control exactly what associations are made with any or all of the words or expressions in your searches, you can do so by editing the equivalence set associated with any word already known by the Equivalence File or by creating associations for a new or created word not yet known.

You can call up the ApproXimate Pattern Matcher (XPM) and tell it to look for a certain percentage of proximity to an entered string, finding misspelled names and typos. You can also look for numeric quantities entered as text, thanks to the Numeric Pattern Matcher (NPM) which recognizes that “four score and seven” is the same amount as “87.”

Metamorph allows users to search for intersections of sets of lexical items, while also performing prefix and suffix morpheme processing. Users can specify, right in their queries, the delimiters of choice: i.e., they can look for lexical intersections within a sentence, a paragraph, a page, a designated amount of text or some other defined textual unit such as a memo.

Texis, with Metamorph inside it, provides a modular set of tools to attack the formidable problem of how to get at and deal with a large volume of information when you don't really know precisely what you need or where to find it. Thunderstone's Texis gives you the power, speed and flexibility to rapidly implement a customized search solution that will accomplish your data access/retrieval objectives in the most dynamic, efficient and pragmatic way possible.

Thunderstone's Texis has a number of characteristics and built-in advantages that differentiate it from other search solutions:

  1. A fully integrated SQL database management system (DBMS) that follows the relational database model, Texis is optimized for addressing the inclusion of unlimited quantities of narrative full text. It provides a method for managing and manipulating an organization's shared data, where intelligent text retrieval is harnessed as a qualifying action for selecting the desired information. Texis simultaneously provides full-text, fielded and Boolean searching of both structured and unstructured content.
  2. Texis powers real-time applications with zero-latency data insertion, providing immediate search availability of new data without waiting for scheduled index updates. Unlike other search tools, Texis ensures that all information which has been added to any table can be searched immediately -- regardless of whether the table has been indexed and regardless of whether it has been suggested that an index be maintained on that table or not. Sequential table space scans and index-based scans are efficiently managed by Texis so that the database can always be searched in the most optimized manner with the most current information available to the user.
  3. Texis enables searchers to sort and group query results by any field(s) in the data. And Texis can quickly sort tens of thousands of hits or more. Other search tools either bog down sorting more than a few hundred items or else their sorting features are much more limited than the capabilities of Texis.
  4. Texis allows you to treat the concatenation of any number of text fields as a single “virtual field.” As a single field you can create an index on the fields, search the fields and perform any other operation allowable on a field.
  5. Texis has high-performance ability to intelligently query and manage databases containing natural language text, numeric values, standard data types, geographic information, images, video, audio and other payload data. While Texis excels at purposeful manipulation of textual information, it also performs useful mathematical operations on your data. You can construct queries that combine calculated values with text search.
  6. Texis lets you create an unlimited number of independent search collections -- each with their own unique data types, fields, attributes or parameters. It also empowers users to submit queries to multiple search engines and/or multiple collections and have the results displayed together or combined.
  7. Texis gives developers prototype-friendly customization tools, extreme flexibility, rapid deployments and a feature-rich API. It supports multiple Search User Interfaces that offer specially-defined views of query input and results for different audience types or even for each unique individual. Thunderstone's Texis imposes no user interface requirements. Texis Web Script (Vortex) maintains “neutrality” with regard to whatever HTML markup (or JavaScript or other user interface technology) is employed for the user results presentation.

Which enterprise search applications require the robustness and flexibility of Texis?

Texis is the premier solution when large-scale, mission-critical and/or complex information retrieval challenges call for full-text searching tightly integrated with traditional structured database querying. Businesses, governments, NGOs and educational institutions use Texis in a wide range of applications such as online catalogs, auctions, classifieds, automated categorization, litigation support, intelligence collection/analysis, risk assessment, quality control, CRM, knowledge discovery, document and multimedia management, internet publishing, vertical portals, real-time message handling, web searching and many others.

Thunderstone's Texis provides the ideal development platform for rapidly deployable, custom-designed applications that require both unstructured and structured types of searching:

  • Online catalogs contain unstructured text (product name, description, etc.) and structured content (style/size, price, in-stock availability, etc.) Users expect the ability to search by item description, to navigate by price range or to do both in combination.
  • Knowledge management systems demand very efficient and secure enterprise-wide information retrieval across multiple repositories that serve different types of users, who all want dynamic, context-sensitive views of defined content (structured data) with the ability to refine results through full-text searching (unstructured data).
  • A Thunderstone solution provider customer has deployed Texis in a "brute force" full-text search scenario for its DoD Intelligence Community customer, using Thunderstone's Texis to search the contents of a massive Oracle database in a counter-terrorism effort. Texis is being used as an adjunct to Oracle full-text search because of its ability to scale while still providing superior performance in both rate of ingestion as well as search. Thunderstone's Texis enables this customer to search across a 20 terabyte index, ingesting 70-80 million new records per hour and returning typical search results in < 10 seconds.
  • A Fortune 20 customer is using Thunderstone's Texis as the search platform for what they describe as the "single largest knowledge management system currently deployed at any corporation in the world." The application encompasses knowledge, people and processes, and it is used globally within the organization to access more than 30 terabytes of data. Users access the application 20+ million times per day, retrieving and sharing information from across the global enterprise. The application is the most-used corporate I.T. resource after e-mail.

Thunderstone's Texis lets users and developers incorporate their own unique knowledge and expertise into customized search solutions that easily integrate with other applications. For additional information call +1 216 820 2200 or visit us online at http://www.thunderstone.com.

Texis' Metamorph Compound Index

January 9, 2013
Texis' Metamorph Compound Index

The METAMORPH and METAMORPH INVERTED indexes in Texis are used to improve the performance of text searches using full-text queries with LIKE, LIKEP, and the rest of the LIKE family. Often the query involves other values, which are used to either sort the results, or further restrict the results returned.

One example is in the Webinator application, which provides the option to sort the results by date. Historically, the way to improve the performance of the ORDER BY was to use an INVERTED INDEX. If you also wanted to do date range restriction, then you could add a regular INDEX as well.

The Metamorph compound index will provide better performance than the three indexes since all the data is available from a single index, and also requires less maintenance. For the query:

SELECT Url FROM html
 WHERE Title\Description\Keywords\Meta\Body LIKE $query
   AND Visited BETWEEN $first AND $last
 ORDER BY Visited DESC;

You could create the index as:

CREATE METAMORPH INVERTED INDEX xhtmlbodv ON HTML(Title\Description\Keywords\Meta\Body, Visited);

Which is the CREATE INDEX statement you will find in the Webinator dowalk script.

If there are several fields that you might use in the query or ORDER BY, then you can specify all of them as additional fields. The order of the fields does not matter, and the engine may use any combination of them. If in Webinator you also wanted to allow searches and sorts based on the Depth field, you could add Depth to the index:

CREATE METAMORPH INVERTED INDEX xhtmlbodvd ON HTML(Title\Description\Keywords\Meta\Body, Visited, Depth);

Then, with the ability of Vortex to ignore parts of the query you could write a query:

<switch $o>
    <case d><$orderby="ORDER BY Depth">
    <case v><$orderby="ORDER BY Visited DESC">
</switch>
<SQL ROW "SELECT Url FROM html
 WHERE Title\Description\Keywords\Meta\Body LIKE $query
    AND (Visited BETWEEN $first AND $last
    AND Depth BETWEEN $low and $high) " $orderby>

That will allow efficient searching and ordering on any combination of Visited and Depth, as long as a query is specified for the LIKE.

The compound index can also be used for GROUP BY or other queries that can fully rely on the index data, e.g.:

SELECT Depth, count(*) from html
 WHERE Title\Description\Keywords\Meta\Body LIKE $query
 GROUP BY Depth;

 

Key facts

  • In a full-text index (any of the variations of METAMORPH INDEX) the first field specified must be the full-text field, and will be indexed accordingly.
  • The first field may be a virtual field, if you want to search across multiple database fields. In the above example we would search the Title, Description, Keywords, Meta and Body fields as if they were a single field.
  • The full-text index will only be used if the full-text field is being queried with a full-text query. In the above example, if there was no LIKE clause, or it was dropped by Vortex because it matches $null, then the METAMORPH INDEX would not be used.
  • The additional fields beyond the full-text field should be small, fixed size fields, most commonly dates and numbers.
  • Using too many additional fields can negate the performance benefits of having the index. Care should be taken to ensure that only those fields actually used in queries are represented in the index.
  • The total size of the additional fields should be small relative to the size of the record, and should not exceed a few hundred bytes per record.
  • The total size of additional indexed data (number or rows multiplied by size per row) should be no larger than 25% of physical memory on the server.
  • If you specify a VARCHAR(N) field as an additional field, you will get a warning message "Variable size warning". The index will still be created, and N bytes of the field will be indexed (where N is from the declaration of the field) for each row. If N is large, this will bloat the index, reducing performance.
  • Updating fixed size fields, including the additional fields can be done without causing the index to go out of date and needing to be updated. Updating the full-text field, or any variable sized field (e.g. VARCHAR, BLOB, INDIRECT) will still cause the index to require an update.
  • Parts of the where clause that use the compound should be grouped together with parentheses for maximum efficiency.

Recent