Real Story Group. Make Better Technology Decisions.

Delivering fearless advice since 2001. Here's our story
What Real Independence means. Find Out

  • Schedule a Demo
  • Free Sample
  • Contact
  • Subscriber Login
  • Your cart is empty.
Sign up for our Newsletter
  • Home
  • Evaluation Reports
  • Premium Subscriptions
  • About
  • Blog
  • Buy Now
  • Recent Entries
  • Get Custom Feeds

 

 

 

Bloem Adriaan Bloem

Read me that file so I can index it, please

8-Apr-2009

Tags: Enterprise Search, Industry Standards, Marketplace at Large, IDOL Server, ISYS Enterprise Access Suite, Lucene and Solr, Secure Enterprise Search 11g

One of those easy-to-overlook but important details of a search engine: will it actually read your files? You may be interested in Lucene, but you'll have to find a way to feed it Office documents and PDFs.

Search engines don't actually directly index the Word document or PDF, they index text. This is where document filters come into play. These do their best to get the text from the file (and usually some metadata, such as an "author" field). If you've ever tried to open some exotic document format in a plain text editor (i.e., Notepad, or VI) you'll understand this can be far from trivial: many of these formats aren't very straightforward.

The problem isn't just trying to find the text, there are quite a few complications: reading across two or three column layouts; what to do with footnotes; or what to index, period. Spreadsheets are troublesome, but what do you make of images, audio, video? And for many scenarios (like indexing a file share) there will be exotic file types to deal with. (I recall the comments at a municipality once: "But we don't have any exotic file types". Three months later, a full crawl unearthed a stack of CAD/CAM files that were vital for planning). To make matters worse, file formats change with the software versions that come out (will the converter read Office 2007 or just Office 95?).

Since it's complicated to build and maintain good filters, most vendors buy them off-the-shelf. As I've talked about before, the market has been cornered by Oracle (with the INSO filters) and Autonomy (with the KeyView filters). Almost all the search engines out there use either Oracle's or Autonomy's converters. A notable exception is Microsoft, which has its own standard for this, IFilters. But IFilters are of varying quality, they don't always work with every Microsoft software product, and you may very well have to build a custom filter yourself for some ancient or rare software.

And there's ISYS -- probably the only vendor we cover in our Search & Information Access Report that has developed converters for over 200 document types entirely by themselves. (Even Oracle and Autonomy didn't really build filters themselves -- they bought the companies that produced them).

It makes sense, then, that ISYS now tries to bank on that hidden capital. The vendor announced last week it's releasing its File Readers as a separately available product. It'll be interesting to see these show up in Lucene implementations (and in content management systems embedding search). More options means more choice. Black may be the fastest drying paint, but maybe you can now have that Model T in purple again.

    Now Get the Complete Real Story

    Vendor Evaluations

    Learn the real strengths and weaknesses of major vendors from around the world, in our research stream.

Tweet

close x

Free Sample Request

  Digital and Media Asset Management
  Document Management (ECM)
  Enterprise Collaboration & Social Software
  Enterprise Search
  Portals and Content Integration
  SharePoint Ecosystem
  Web Content and Experience Management
 Send me bi-weekly tips and insights from Real Story Group.
Your personal information, including your e-mail address, will be held in the strictest of confidence and will never be shared with anyone.

Subscriber Log In


Remember Me
Forgot password?


Not a subscriber?
Learn about our subscriptions

Research Mentioned in this Post

Vendor Evaluations

 | 

Our Newsletter

Get the Real Story bi-weekly.

Have Questions?

USA & Canada
+1 800 325 6190

UK
+44 (0) 20 3318 1911

International
+1 617 340 6464


All Other Inquiries

Our Customers Say

"We found the Enterprise Portals Research to be extremely helpful in our efforts to implement a revamped and improved portal at our organization. Not only were the portal product evaluations impartial and thorough, but the 'Best Practices' chapter was excellent! Your research saved us a lot of time and money by consolidating a wealth of information in one spot. Worth every penny!"

Alex L. Brown, Portal Communications Coordinator, Student Assistance Foundation

next More

Real Story Group

Follow us on:  RSS  |  Twitter  |  Facebook  |  YouTube

Evaluation Reports

  • Web Content and Experience Management
  • Digital and Media Asset Management
  • Enterprise Collaboration & Social Software
  • Document Management (ECM)
  • Portals and Content Integration
  • Enterprise Search
  • SharePoint Ecosystem

Premium Subscriptions

  • Research Streams
  • Advisory Papers
  • Vendors Evaluated
  • Schedule Analyst Consultation
  • Online Education
  • Configure a Subscription

About Us

  • Our Methodology
  • Our Team
  • Media
  • Customer List
  • Events
  • Consulting
  • Contact Us

Need Help?

  • Talk to an Expert
  • FAQs
  • Customer Support
  • Contact Sales Team
  • Help with your account

Copyright Real Story Group 2001 - 2012. All rights reserved.

  • Contact Us
  • Copyright Policy
  • Privacy Policy
  • Terms of Use

Log In

Remember MeForgot password?

close x
close x

All analyst firms claim to be independent or vendor-neutral. We're different.

Real Independence


Get the real story on commercial and open source tools from a firm that works only for you, the technology customer.

close x

Newsletter Signup

Thank you for signing up for The Real Story Group Newsletter. You will receive our monthly newsletter, plus updates with new information on the technology streams you have expressed interest in below.










Choose the streams that you’d like to receive updates for: