Delivering fearless advice since 2001. Here's our story
What Real Independence means. Find Out
Kas Thomas
18-Aug-2009
Tags: Enterprise Search, IDOL Server, Mindserver Enterprise Search, Publishing-Media
Anyone who's been involved in a corporate-taxonomy project knows exactly how the terms "tedium," "tiresome," and "taxonomy" are related. Each derives from the other.
At some point, techonology should remove the need for taxonomy projects, even if it hasn't -- yet.
Help is on the way, though -- assuming you have, say, $150K (plus or minus a Toyota) to spend. Today, San Francisco-based Recommind, Inc. (one of the vendors we cover in our Search & Information Access Report) is introducing MindServer Categorization, a software system that does just what its name implies: It analyzes content, discovers logical categories within the content, and auto-tags each content item according to category relatedness.
Although it's being introduced today as a standalone product, MindServer Categorization -- technically speaking -- is not new. The product has been sold in Germany for years, where major media companies have used it to auto-categorize news feeds. Today's release represents the first time MindServer Categorization has been localized into English and productized for a general market (i.e., not just media firms).
Recommind is not the only company with auto-categorization technology, of course. (Autonomy, often seen on shortlists next to Recommind, is a familiar source of such technology.) But unlike others, Recommind uses PLSA (Probabilistic Latent Semantic Analysis) as a basis for category discovery, which means, among other things, that Recommind's software requires no training: It doesn't need to be exposed to a "training set" (or sets), have access to a preexisting taxonomy, nor know about keywords. In fact, MindServer Categorization is not only self-training but language-agnostic. In theory, the underlying algorithms can discriminate categories in any corpus, regardless of what language the corpus is in.
Exactly how efficient the system is, you'll have to determine yourself by testing it against a corpus or two of your own. The rate of false positives and false negatives will vary according to the characteristics of the corpus and the tuning parameters you specify. (You can relax or tighten the system's "strictness" through config settings and a C++ API.) Don't expect this -- or any other -- auto-categorization system to be perfect, or anything close to it.
Notably, although Recommind does a lot of business with law firms and legal departments, who use Recommind's search software to categorize e-mail (as well as do more sophisticated kinds of things, such as divining who the domain experts are, in an organization, based on correspondence), some customers are content just to have Recommind's software separate content into two categories: garbage, and content that clearly should be saved.
If it's true, as some research indicates, that 1 GB of data can cost up to $20,000 to collect, process, review, and retain, then the $150K entry fee for MindServer Categorization would seem quite reasonable. (Bear in mind, maintenance is another ~$30K per year on top of that.) But one wonders how long it will be before entity-extraction software, auto-taggers, RDF extractors, and the like become commoditized through Open Source. Also, Recommind and Autonomy face a different sort of competition from the likes of Thomson Reuters, whose Calais project provides what amounts to semantic analysis as a service. (While it's true you might not want to send your entire corporate e-mail archive over the wire to the Calais service, nevertheless you might very well want to stream select RSS or Atom feeds through it -- and many people already do, apparently.)
At the moment, the company with the greatest exposure (and therefore the most to lose) in this field is Autonomy, whose IDOL technology has cemented the company's reputation for intelligent information retrieval. It will be interesting to see whether upstart Recommind can put a dent in Autonomy's semantic suit of armor -- or whether the two companies are, in fact, destined to remain in separate categories forever.
Get the Real Story bi-weekly.
USA & Canada
+1 800 325 6190
UK
+44 (0) 20 3318 1911
International
+1 617 340 6464
All Other Inquiries
"The Collaboration & Community Software Research is by far the most exhaustive and comprehensive attempt to understand and evaluate the landscape of social software with an eye toward helping enterprises make smart decisions I've ever seen."
John Eckman, Senior Director, Optaros Labs
Copyright Real Story Group 2001 - 2012. All rights reserved.
All analyst firms claim to be independent or vendor-neutral. We're different.
Get the real story on commercial and open source tools from a firm that works only for you, the technology customer.
Thank you for signing up for The Real Story Group Newsletter. You will receive our monthly newsletter, plus updates with new information on the technology streams you have expressed interest in below.