Real Story Group. Make Better Technology Decisions.

Delivering fearless advice since 2001. Here's our story
What Real Independence means. Find Out

  • Schedule a Demo
  • Free Sample
  • Contact
  • Subscriber Login
  • Your cart is empty.
Sign up for our Newsletter
  • Home
  • Evaluation Reports
  • Premium Subscriptions
  • About
  • Blog
  • Buy Now
  • Recent Entries
  • Get Custom Feeds

 

 

 

Bloem Adriaan Bloem

Search the X-Files: unknown entities

21-Sep-2007

Tags: Enterprise Search, Industry Standards, Information Architecture, Selecting Technology, , Endeca Information Access Platform, ISYS Enterprise Access Suite

If you're in the market for search technology, you probably hear a lot about faceted browsing, guided navigation, refining, clustering, categorization, and so on. Many of today's search engines attempt to present more than just keyword search. That's fine if your content has high-quality structured metadata, but what if you throw in thousands of Word documents where the "author" is defined as John Doe? The truth may be out there -- but the answer is buried deep in the text.

Distilling things like people, email addresses, and company names from source content is what is known as "entity extraction." Vendors may tell you that yes, their search interface pivots off that kind of data (e.g., for guided navigation), but don't worry: they can extract the unknown entities even if you throw in large files nobody ever bothered to tag right. Enterprise search will create and then reveal structure where once there was chaos.

Of course, this is not at all the black magic it is made out to be. Finding relevant entities is usually accomplished through a combination of pattern-matching and dictionaries. An email address will contain the "@" symbol, and it's pretty safe to say that if it's followed by a dotted domain name, you've got your address. If "John" is in your dictionary of first names, the next capitalized word will probably be the surname. This also means that entity extraction is language- and even country-specific. A representative of Fast Search & Transfer's professional services told me about the challenges the company faced finding a fail-safe way of distilling German street addresses, which have a very different and much less formal structure than those in, say, North America.

Many vendors, of course, won't like you to be distracted with the details of their "automagical" ways of achieving this. Their method may be English- and US-specific, but hey, so what -- if your company is based in the US and content comes in English, you're fine. In reality, things are never that easy though.

I was running a test of ISYS:web against the CMS Watch website, and was pleasantly surprised to see the out-of-the-box installation correctly identified several countries, and had no problems finding out that Tony Byrne is an actual person. It even managed to extract Janus Boye's somewhat more exotic Danish name. Unsurprisingly, Apoorv Durga was a bit too outlandish and my ego wasn't hurt when Adriaan Bloem wasn't ranked among the people. But you really don't want to provide Theresa Regli with cannon fodder by ignoring her (which it did), while on the other hand, I can't recall ever having met "Read More," international man of mystery, now a full-fledged person in my search engine.

This is not to say you should bash ISYS for this -- the company is the first to admit its methods aren't infallible, and many vendors at a much higher price point don't even offer similar technology, instead relying on third-party tools. What it does mean, however, is you shouldn't take claims that "it's all taken care of" at face value. Investigate whether languages and countries relevant to you are supported, and better still, test against your own content. Then assume you are committing yourself to near constant system training and tweaking.

Failing that, some search products will allow you to specify additional criteria (with ISYS, for instance, "Theresa" was easily added with a [pre] construct in a text-based configuration file). Others enable you to define completely new entities and patterns from scratch (such as FAST's processing in Python, or Endeca's XSL and Perl). Be very aware, though, that sending in a Mulder agent to investigate your X-Files might be a costly, ongoing adventure, lasting nine seasons of suspense.

    Now Get the Complete Real Story

    Vendor Evaluations

    Learn the real strengths and weaknesses of major vendors from around the world, in our research stream.

Tweet

close x

Free Sample Request

  Digital and Media Asset Management
  Document Management (ECM)
  Enterprise Collaboration & Social Software
  Enterprise Search
  Portals and Content Integration
  SharePoint Ecosystem
  Web Content and Experience Management
 Send me bi-weekly tips and insights from Real Story Group.
Your personal information, including your e-mail address, will be held in the strictest of confidence and will never be shared with anyone.

Subscriber Log In


Remember Me
Forgot password?


Not a subscriber?
Learn about our subscriptions

Research Mentioned in this Post

Vendor Evaluations

 | 

Our Newsletter

Get the Real Story bi-weekly.

Have Questions?

USA & Canada
+1 800 325 6190

UK
+44 (0) 20 3318 1911

International
+1 617 340 6464


All Other Inquiries

Our Customers Say

"Bottom line: I wholeheartedly recommend the ECM Research as a resource for end-user organizations -- and consultants who haven't worked with every solution listed would be well served to acquire a copy, as well."

Jesse Wilkins, Principal Consultant, Access Sciences

next More

Real Story Group

Follow us on:  RSS  |  Twitter  |  Facebook  |  YouTube

Evaluation Reports

  • Web Content and Experience Management
  • Digital and Media Asset Management
  • Enterprise Collaboration & Social Software
  • Document Management (ECM)
  • Portals and Content Integration
  • Enterprise Search
  • SharePoint Ecosystem

Premium Subscriptions

  • Research Streams
  • Advisory Papers
  • Vendors Evaluated
  • Schedule Analyst Consultation
  • Online Education
  • Configure a Subscription

About Us

  • Our Methodology
  • Our Team
  • Media
  • Customer List
  • Events
  • Consulting
  • Contact Us

Need Help?

  • Talk to an Expert
  • FAQs
  • Customer Support
  • Contact Sales Team
  • Help with your account

Copyright Real Story Group 2001 - 2012. All rights reserved.

  • Contact Us
  • Copyright Policy
  • Privacy Policy
  • Terms of Use

Log In

Remember MeForgot password?

close x
close x

All analyst firms claim to be independent or vendor-neutral. We're different.

Real Independence


Get the real story on commercial and open source tools from a firm that works only for you, the technology customer.

close x

Newsletter Signup

Thank you for signing up for The Real Story Group Newsletter. You will receive our monthly newsletter, plus updates with new information on the technology streams you have expressed interest in below.










Choose the streams that you’d like to receive updates for: