IBM, Lucene, and the future of search

  • 11-Nov-2009

I've been covering IBM's search technology (for our Search and Information Access Research) for two years now, and I confess that I've never quite totally understood the strategy (if there is one) behind IBM OmniFind Yahoo! Edition (OYE).

OYE is the free, Apache Lucene-based search application that IBM has offered since 2006. IBM does have customers who pay for commercial support for OYE, and according to Big Blue there have been over 50,000 downloads of OYE to date. But OYE isn't something IBM pushes heavily, and Google's search appliance business hasn't suffered appreciably in the face of competition by OYE.

One wonders, then: Why bother offering something like OYE at all? What's the point in putting the "IBM OmniFind" moniker on a technology that is really mostly Lucene on the back end and Yahoo on the front end? It seems (on the surface) like rather a quick-and-easy way to try to get some of the "cool factor" from Lucene to rub off on OYE -- a kind of coolness by association.

It now seems likely that OYE was (among other things) an IBM testbed project for Lucene development, ahead of the eventual, inevitable Lucenization of the entire OmniFind family of products. And in fact an IBM rep told me that Big Blue will indeed be moving OmniFind Enterprise Edition to a Lucene-based core architecture eventually. This is big news from a number of standpoints. It's a huge endorsement (if Lucene needed any, at this point) of the open-source search engine's maturity and soundness; and it can only solidify Lucene's position of dominance in the open-source search firmament. It also brings Lucene and UIMA (Unstructured Information Management Architecture) closer together, hinting at the emergence (though not right away) of an industry-standard text analytics architecture.

A lot is at stake for IBM, too: The key pieces of IBM's information-access strategy -- including InfoSphere Content Assessment (ICA), InfoSphere Content Collector (ICC), and InfoSphere Classification Module (ICM) -- all employ the OmniFind Enterprise Edition search infrastructure in various ways. With Lucene and UIMA occupying center stage, IBM is betting a lot on this technology. 

What does it mean to you, the technology buyer? First, expect to see further significant investment in Lucene by the IT world -- and further blossoming of the technology ecosphere around Lucene -- as Lucene becomes the key enabling technology underneath a variety of content-analytics applications. A year from now, Lucene won't simply mean "search" -- it could become the enabling technology for content-analytics apps of various kinds (including some kinds that haven't even been envisioned yet).

Secondly, it may prompt the much-prophesied (but never realized) advent of a broad secondary ecosystem around UIMA: an ecosystem of parsers, annotators, and pluggable business rules.

Thirdly, we may see the emergence of a new wave of prospective standards around things like index formats, relevance, and tokenization.

And finally? Expect to see interesting arguments from the likes of Microsoft and Autonomy as to why their proprietary search solutions are better for you in the long run than more open architectures. It should make for an interesting discussion. Subscribers, stay tuned.

