Delivering fearless advice since 2001. Here's our story
What Real Independence means. Find Out
Adriaan Bloem
8-Apr-2009
Tags: Enterprise Search, Industry Standards, Marketplace at Large, IDOL Server, ISYS Enterprise Access Suite, Lucene and Solr, Secure Enterprise Search 11g
One of those easy-to-overlook but important details of a search engine: will it actually read your files? You may be interested in Lucene, but you'll have to find a way to feed it Office documents and PDFs.
Search engines don't actually directly index the Word document or PDF, they index text. This is where document filters come into play. These do their best to get the text from the file (and usually some metadata, such as an "author" field). If you've ever tried to open some exotic document format in a plain text editor (i.e., Notepad, or VI) you'll understand this can be far from trivial: many of these formats aren't very straightforward.
The problem isn't just trying to find the text, there are quite a few complications: reading across two or three column layouts; what to do with footnotes; or what to index, period. Spreadsheets are troublesome, but what do you make of images, audio, video? And for many scenarios (like indexing a file share) there will be exotic file types to deal with. (I recall the comments at a municipality once: "But we don't have any exotic file types". Three months later, a full crawl unearthed a stack of CAD/CAM files that were vital for planning). To make matters worse, file formats change with the software versions that come out (will the converter read Office 2007 or just Office 95?).
Since it's complicated to build and maintain good filters, most vendors buy them off-the-shelf. As I've talked about before, the market has been cornered by Oracle (with the INSO filters) and Autonomy (with the KeyView filters). Almost all the search engines out there use either Oracle's or Autonomy's converters. A notable exception is Microsoft, which has its own standard for this, IFilters. But IFilters are of varying quality, they don't always work with every Microsoft software product, and you may very well have to build a custom filter yourself for some ancient or rare software.
And there's ISYS -- probably the only vendor we cover in our Search & Information Access Report that has developed converters for over 200 document types entirely by themselves. (Even Oracle and Autonomy didn't really build filters themselves -- they bought the companies that produced them).
It makes sense, then, that ISYS now tries to bank on that hidden capital. The vendor announced last week it's releasing its File Readers as a separately available product. It'll be interesting to see these show up in Lucene implementations (and in content management systems embedding search). More options means more choice. Black may be the fastest drying paint, but maybe you can now have that Model T in purple again.
Get the Real Story bi-weekly.
USA & Canada
+1 800 325 6190
UK
+44 (0) 20 3318 1911
International
+1 617 340 6464
All Other Inquiries
"We found the Enterprise Portals Research to be extremely helpful in our efforts to implement a revamped and improved portal at our organization. Not only were the portal product evaluations impartial and thorough, but the 'Best Practices' chapter was excellent! Your research saved us a lot of time and money by consolidating a wealth of information in one spot. Worth every penny!"
Alex L. Brown, Portal Communications Coordinator, Student Assistance Foundation
Copyright Real Story Group 2001 - 2012. All rights reserved.
All analyst firms claim to be independent or vendor-neutral. We're different.
Get the real story on commercial and open source tools from a firm that works only for you, the technology customer.
Thank you for signing up for The Real Story Group Newsletter. You will receive our monthly newsletter, plus updates with new information on the technology streams you have expressed interest in below.