Last time we discussed exclusions and requirements for managing what pages your crawler gets, but there's one setting that gets a Tech Tips all to its own: Exclude by Field. It gives you extra power in how you're excluding and what exactly is being excluded.
- "Metamorph query" matching
Rather than a prefix or substring match, Exclude by Field uses a "Metamorph query", which is the full-text matching engine used for our normal searches. You can simply type in words to match, or if you begin with a slash (/) then it is treated as a REX expression (our RegEx-like pattern matching language; see the "REX" section in the Vortex documentation on our website for more details).
- Multiple fields for exclusion
All previously discussed exclusion & requirement options operate only on the URL itself. Exclude by Field allows you to exclude based on a number of different other areas:
- HTML — Matches against the raw HTML of the page. Useful if there's something in an HTML comment that you'd like to base the match on.
- Text — The formatted text of the URL. This is the same text you'd see if you looked at the list/edit info of a page or at the "Match Info" in the search results. Useful if you what to match text but want to ignore any HTML markup that may or may not be present.
- All Meta — The contents of all available meta fields are put together and then matched against.
- Meta Field -> — Matches against the contents of a specific meta field, which you specify in the next column "From Meta Field".
- Keywords, Description, & Mime Type — Matches against the text of these common meta fields.
- URL — Matches against the URL, just like Exclusion REX. You may want to use this to get the extra Exclude options, listed below.
- What to exclude
Beyond more power in specifying what to match, Exclude by Field also gives you more control with what to do when you get a match.
- Pages and Links — This acts like any other exclusion rule. The page and its links are kept completely out of the walk data.
- Pages only — The content of the page is not included in the walk, but the links from the page ARE followed.
- Links only — The page is included in the walk, but the links from the page are not followed.
- A word on efficiency
A disadvantage that Exclude by Field has when using any Field except URL is the page must be fully fetched before the rule can be applied.
With all other exclusion rules (and Exclude by Field on URL), the URL can be thrown out before the page is fetched an processed.
When performing Exclude by Field on the content of the page, though, the page must be downloaded and fully processed before we can know if it has HTML or a Body that matches the rules specified.
When possible, it's better to use other exclusion rules or the URL target for Exclude by Field, as this will allow you to prune URLs
before they are fetched. Still, there are many things that Exclude
by Field can do that the other settings simply can't (as mentioned below).
- Example — Excluding directories from a file crawl
A perfect example of Exclude by Field is directories when performing a file crawl — we can't fully exclude directories because they are what link to all the files, and without them we'd have nothing. Still, we might want them not to show up in the search. We can get this with Exclude by Field.
- Metamorph Query "//=>>=" (without the quotes) — This is a REX expression for "match anything that ends in a slash". Please see the REX section of the Vortex documentation if you'd like more details on REX syntax.
- Field - URL
- Exclude - Pages only — This will keep the contents of the directory "pages" out of the crawl but will still follow the links to get the actual files and use them in the search.
If you have any questions about how to use Exclude by Field, please feel free to contact Thunderstone Support — and we'll discuss it.
The February 2009 issue of CRN, a publication of Everything Channel and ChannelWeb.com, recognized the "top Channel Chiefs in the industry based upon their record of business innovation and dedication to the partner community." This annual list, which CRN calls "Our definitive guide to the movers and shakers of I.T. channel management," included Frederick A. Harmon (Thunderstone's Channel Director & CSO.)
You can visit the CRN website (http://www.crn.com/crn/chiefs/2009cc.jhtml?chief=136) to view pertinent information about Fred Harmon in the 2009 Channel Chiefs list.
Thunderstone's John Turnbull (President and CEO) will present a workshop session entitled The Next Generation in Search: Today's Best Practices on Friday, April 17, 2009, (2:00 p.m. - 3:30 p.m.) during the DigitalNow 2009 Conference at Disney's Yacht and Beach Club Resorts in Lake Buena Vista, Florida.
Search has progressed from a complex tool used by librarians through simple tools that let users perform a keyword search, to today's information access tools that can still provide users a simple interface but make use of much of an association's collective knowledge. In this workshop participants will learn what sorts of information can be behind a search engine and how to make it more valuable to users. The session includes a case study from IEEE, the world's largest technical membership association that significantly improved their business by focusing on their customers and helping them access content in new ways.
DigitalNow (http://www.fusionproductions.com/digitalnow/) is an annual conference that brings together senior-level executives and volunteer leaders from some of the most influential professional and trade associations in America. Produced by Fusion Productions and Disney Institute, two of the foremost authorities in adult educational design, with input from registered attendees and a conference advisory board, DigitalNow addresses the critical issues facing association leaders in the digital age.
The AIIM International Exposition + Conference, the yearly gathering for information management professionals across industries and lines of business, will take place Monday, March 30, through Thursday, April 2, 2009, at the Pennsylvania Convention Center in Philadelphia, PA. With 19 tracks, more than 135 conference sessions featuring more than 100 real-world case studies, and an Expo floor showcasing 200+ information management technology solution providers, the event aims to provide attendees with actionable insight they can use.
REGISTER TODAY FOR YOUR FREE EXPO FLOOR PASS
and get access to all keynotes, general sessions,
Expo floor education and the co-located ON DEMAND Expo!
To receive your free pass, use Registration Code: 615M
when you register at WWW.AIIMEXPO.COM
or call +1 888 824 3004.
Your FREE pass comes to you compliments of Thunderstone Software. Please stop by and visit Fred Harmon (Channel Director & CSO) and Peter Thusat (Communication Director & CMO) at Booth 1045.
Feedback, suggestions and questions are welcome. Send your email to