Search the X-Files: unknown entities

If you're in the market for search technology, you probably hear a lot about faceted browsing, guided navigation, refining, clustering, categorization, and so on. Many of today's search engines attempt to present more than just keyword search. That's fine if your content has high-quality structured metadata, but what if you throw in thousands of Word documents where the "author" is defined as John Doe? The truth may be out there -- but the answer is buried deep in the text.

Distilling things like people, email addresses, and company names from source content is what is known as "entity extraction." Vendors may tell you that yes, their search interface pivots off that kind of data (e.g., for guided navigation), but don't worry: they can extract the unknown entities even if you throw in large files nobody ever bothered to tag right. Enterprise search will create and then reveal structure where once there was chaos.

Of course, this is not at all the black magic it is made out to be. Finding relevant entities is usually accomplished through a combination of pattern-matching and dictionaries. An email address will contain the "@" symbol, and it's pretty safe to say that if it's followed by a dotted domain name, you've got your address. If "John" is in your dictionary of first names, the next capitalized word will probably be the surname. This also means that entity extraction is language- and even country-specific. A representative of Fast Search & Transfer's professional services told me about the challenges the company faced finding a fail-safe way of distilling German street addresses, which have a very different and much less formal structure than those in, say, North America.

Many vendors, of course, won't like you to be distracted with the details of their "automagical" ways of achieving this. Their method may be English- and US-specific, but hey, so what -- if your company is based in the US and content comes in English, you're fine. In reality, things are never that easy though.

I was running a test of ISYS:web against the CMS Watch website, and was pleasantly surprised to see the out-of-the-box installation correctly identified several countries, and had no problems finding out that Tony Byrne is an actual person.  Unsurprisingly, Apoorv Durga was a bit too outlandish and my ego wasn't hurt when Adriaan Bloem wasn't ranked among the people. But you really don't want to provide Theresa Regli with cannon fodder by ignoring her (which it did), while on the other hand, I can't recall ever having met "Read More," international man of mystery, now a full-fledged person in my search engine.

This is not to say you should bash ISYS for this -- the company is the first to admit its methods aren't infallible, and many vendors at a much higher price point don't even offer similar technology, instead relying on third-party tools. What it does mean, however, is you shouldn't take claims that "it's all taken care of" at face value. Investigate whether languages and countries relevant to you are supported, and better still, test against your own content. Then assume you are committing yourself to near constant system training and tweaking.

Failing that, some search products will allow you to specify additional criteria (with ISYS, for instance, "Theresa" was easily added with a [pre] construct in a text-based configuration file). Others enable you to define completely new entities and patterns from scratch (such as FAST's processing in Python, or Endeca's XSL and Perl). Be very aware, though, that sending in a Mulder agent to investigate your X-Files might be a costly, ongoing adventure, lasting nine seasons of suspense.


Our customers say...

"I've seen a lot of basic vendor comparison guides, but none of them come close to the technical depth, real-life experience, and hard-hitting critiques that I found in the Search & Information Access Research. When I need the real scoop about vendors, I always turn to the Real Story Group."


Alexander T. Deligtisch, Co-founder & Vice President, Spliteye Multimedia
Spliteye Multimedia

Other Posts