Recommind productizes its categorization engine

Anyone who's been involved in a corporate-taxonomy project knows exactly how the terms "tedium," "tiresome," and "taxonomy" are related. Each derives from the other.

At some point, techonology should remove the need for taxonomy projects, even if it hasn't -- yet.

Help is on the way, though -- assuming you have, say, $150K (plus or minus a Toyota) to spend. Today, San Francisco-based Recommind, Inc. (one of the vendors we cover in our Search & Information Access Report) is introducing MindServer Categorization, a software system that does just what its name implies: It analyzes content, discovers logical categories within the content, and auto-tags each content item according to category relatedness.

Although it's being introduced today as a standalone product, MindServer Categorization -- technically speaking -- is not new. The product has been sold in Germany for years, where major media companies have used it to auto-categorize news feeds. Today's release represents the first time MindServer Categorization has been localized into English and productized for a general market (i.e., not just media firms).

Recommind is not the only company with auto-categorization technology, of course. (Autonomy, often seen on shortlists next to Recommind, is a familiar source of such technology.) But unlike others, Recommind uses PLSA (Probabilistic Latent Semantic Analysis) as a basis for category discovery, which means, among other things, that Recommind's software requires no training: It doesn't need to be exposed to a "training set" (or sets), have access to a preexisting taxonomy, nor know about keywords. In fact, MindServer Categorization is not only self-training but language-agnostic. In theory, the underlying algorithms can discriminate categories in any corpus, regardless of what language the corpus is in.

Exactly how efficient the system is, you'll have to determine yourself by testing it against a corpus or two of your own. The rate of false positives and false negatives will vary according to the characteristics of the corpus and the tuning parameters you specify. (You can relax or tighten the system's "strictness" through config settings and a C++ API.) Don't expect this -- or any other -- auto-categorization system to be perfect, or anything close to it.

Notably, although Recommind does a lot of business with law firms and legal departments, who use Recommind's search software to categorize e-mail (as well as do more sophisticated kinds of things, such as divining who the domain experts are, in an organization, based on correspondence), some customers are content just to have Recommind's software separate content into two categories: garbage, and content that clearly should be saved.

If it's true, as some research indicates, that 1 GB of data can cost up to $20,000 to collect, process, review, and retain, then the $150K entry fee for MindServer Categorization would seem quite reasonable. (Bear in mind, maintenance is another ~$30K per year on top of that.) But one wonders how long it will be before entity-extraction software, auto-taggers, RDF extractors, and the like become commoditized through Open Source. Also, Recommind and Autonomy face a different sort of competition from the likes of Thomson Reuters, whose Calais project provides what amounts to semantic analysis as a service. (While it's true you might not want to send your entire corporate e-mail archive over the wire to the Calais service, nevertheless you might very well want to stream select RSS or Atom feeds through it -- and many people already do, apparently.) 

At the moment, the company with the greatest exposure (and therefore the most to lose) in this field is Autonomy, whose IDOL technology has cemented the company's reputation for intelligent information retrieval. It will be interesting to see whether upstart Recommind can put a dent in Autonomy's semantic suit of armor -- or whether the two companies are, in fact, destined to remain in separate categories forever.


Our customers say...

"I've seen a lot of basic vendor comparison guides, but none of them come close to the technical depth, real-life experience, and hard-hitting critiques that I found in the Search & Information Access Research. When I need the real scoop about vendors, I always turn to the Real Story Group."


Alexander T. Deligtisch, Co-founder & Vice President, Spliteye Multimedia
Spliteye Multimedia

Other Posts