Real Story Group. Make Better Technology Decisions.

Delivering fearless advice since 2001. Here's our story
What Real Independence means. Find Out

  • Schedule a Demo
  • Free Sample
  • Contact
  • Subscriber Login
  • Your cart is empty.
Sign up for our Newsletter
  • Home
  • Evaluation Reports
  • Premium Subscriptions
  • About
  • Blog
  • Buy Now
  • Recent Entries
  • Get Custom Feeds

 

 

 

Thomas Kas Thomas

Enterprise Search Scalability: A Big Issue

9-Jul-2008

Tags: Enterprise Search, Implementation, Industry Standards, Information Architecture, Selecting Technology

I was talking to a search vendor the other day who said something that really got my attention. He remarked that a customer recently came to him and asked what it would take, in terms of software, hardware, and time, to index 30 billion documents. Mind you, this was not some hypothetical exercise. The question came from somebody whose company actually has 30 billion documents under management.

Consider the dimensions of the problem. Assuming (for purposes of argument) you could index a thousand documents per second on one machine, it would take a full year just to build the index for 30 billion docs. If the solution scales linearly, building (or rebuilding) the index would keep a 100-machine server farm busy for the better part of a week.

That's considering the scenario in a static context only. In the real world, of course, documents are revised (some frequently, others never, most somewhere in-between). New docs enter the system. Old ones are dropped. Otherwise-unchanged docs are moved to new locations. Unless you can update your index(es) incrementally, in real time, as docs are added, deleted, modified, or moved, you have an index shelf-life problem.

The traditional answer to the shelf-life problem is to rebuild the index every few days (or every night, if resources allow). At the level of ten or twenty thousand docs, a total rebuild of the index every few days isn't a huge issue. But when you get beyond something like a few million docs, performing total-rebuilds on a frequent basis quickly becomes a worst practice (if indeed it's practicable at all). At some point, you need the ability to do incremental indexing.

But, someone will ask, can't you just throw more machine resources (threads, memory, cycles) at the problem? Yes and no. If you're spidering files over the wire, bandwidth exhaustion becomes an issue. If you're indexing files locally, there's an OS-imposed limit on how many files you can have open at once. There's also the question of how much file data you can hold in memory. The reason this is important is that some search systems (quite a few, actually) need to load an entire document into memory before the doc can be indexed. If you're indexing 10-megabyte PDFs, it might not matter how many threads you have available. (Note, incidentally, that most docs occupy a lot more space in memory than on disk.) And anyway, the CPU can execute only so many instructions per second, no matter how many docs you can load at once.

I bring all this up for a couple of reasons. First, if you're shopping for a search solution, you need to regard the various vendors' performance claims with more than a modicum of caution. No two search scenarios are the same, obviously. But more than that, the parameters that affect scalability and performance are numerous and non-obvious (and their interactions subtle), tending to moot most performance claims straight out of the gate.

Takeaway No. 1: If you care about performance (and you should), do your own testing. Insist on it as part of any product evaluation.

Takeaway No. 2: Get your programmers involved in the evaluation process early. Some of these issues require computer-science expertise to evaluate properly.

Also (very important), when shopping for a search solution, don't buy for your present needs. Shop for your future needs. Your company probably has ten times more content under management today than it had just five years ago. Five years from now, it could have ten times more than today. Will your search solution scale appropriately? More particularly, how will it scale? Will it scale linearly? Will it hit a brick wall?

If I were searching for a search solution, I'd ask every vendor a few simple questions:

  • How big is your biggest customer installation and what did it take to build it?
  • Can your system do incremental indexing? How often is a full rebuild required?
  • Does your indexer need to read a document into memory (whole) before indexing it, or can files be stream-processed?
  • What's the largest document your system can index without either choking or stopping after a particular number of characters?
  • How does indexing performance change as the index gets bigger? (Not just "does it slow down?" but how does it slow? Linearly? Exponentially? If it's the latter, you're going to hit a brick wall.)
  • And: Do you support 64-bit architectures?

Those are just a few conversation-starters. For more (lots more), be sure to consult our Enterprise Search Report 2008. (You can get a free sample of it online here.) And if you end up evaluating one or more search offerings in depth, please drop us a line and let us know what you learned. We're always interested in your feedback.

    Now Get the Complete Real Story

    Vendor Evaluations

    Learn the real strengths and weaknesses of major vendors from around the world, in our research stream.

Tweet

close x

Free Sample Request

  Digital and Media Asset Management
  Document Management (ECM)
  Enterprise Collaboration & Social Software
  Enterprise Search
  Portals and Content Integration
  SharePoint Ecosystem
  Web Content and Experience Management
 Send me bi-weekly tips and insights from Real Story Group.
Your personal information, including your e-mail address, will be held in the strictest of confidence and will never be shared with anyone.

Subscriber Log In


Remember Me
Forgot password?


Not a subscriber?
Learn about our subscriptions

Research Mentioned in this Post

Vendor Evaluations

 | 

Our Newsletter

Get the Real Story bi-weekly.

Have Questions?

USA & Canada
+1 800 325 6190

UK
+44 (0) 20 3318 1911

International
+1 617 340 6464


All Other Inquiries

Our Customers Say

"Portals are where 'synergy' stops being a buzz word and becomes a tangible business benefit. If you're considering portal software, you can't afford to miss this comprehensive market review. And if you've already started a project, the common-sense advice contained in this outstanding research could save you thousands of dollars -- and hours."

Eric L. Reiss, Author of "Practical Information Architecture"

next More

Real Story Group

Follow us on:  RSS  |  Twitter  |  Facebook  |  YouTube

Evaluation Reports

  • Web Content and Experience Management
  • Digital and Media Asset Management
  • Enterprise Collaboration & Social Software
  • Document Management (ECM)
  • Portals and Content Integration
  • Enterprise Search
  • SharePoint Ecosystem

Premium Subscriptions

  • Research Streams
  • Advisory Papers
  • Vendors Evaluated
  • Schedule Analyst Consultation
  • Online Education
  • Configure a Subscription

About Us

  • Our Methodology
  • Our Team
  • Media
  • Customer List
  • Events
  • Consulting
  • Contact Us

Need Help?

  • Talk to an Expert
  • FAQs
  • Customer Support
  • Contact Sales Team
  • Help with your account

Copyright Real Story Group 2001 - 2012. All rights reserved.

  • Contact Us
  • Copyright Policy
  • Privacy Policy
  • Terms of Use

Log In

Remember MeForgot password?

close x
close x

All analyst firms claim to be independent or vendor-neutral. We're different.

Real Independence


Get the real story on commercial and open source tools from a firm that works only for you, the technology customer.

close x

Newsletter Signup

Thank you for signing up for The Real Story Group Newsletter. You will receive our monthly newsletter, plus updates with new information on the technology streams you have expressed interest in below.










Choose the streams that you’d like to receive updates for: