GSA6: Google Billions, Revisited

Last week, I posted a highly critical comment on Google's marketing of the Appliance, version 6. My main qualm is that the hyperbole makes it very hard to understand what it actually is they're selling. What you get with a GSA is not exactly how it looks on YouTube (well, the box is, but not necessarily the internals).

Of course, in my quest to get you the real story, I'm not going to leave it at "press releases and documentation don't match up". The interesting bit is what the software is actually capable of; even more interesting is what customers are doing with it in reality.

For now, I'll zoom in on what made the headlines: the Appliance's new capability to index billions of documents, rather than the 30 million of previous version. I noted two things about this:

  • The Dynamic Scalibility feature, according to the version 6.0 documentation, "enables multiple Google Search Appliances to work together to scale up to 30 million documents and provide unified search results" (not billions);
  • Being able to index billions of documents, in general (and this applies to all vendors) is a rather meaningless statement, since it really depends on what you're indexing (I used the example of 10-digit phone numbers vs. 40mb PDFs).

Google got in touch with me to explain this, and this led to two surprises.

First of all: Dynamic Scalibility is, in fact, the feature that would enable indexing billions of documents, and this isn't a beta feature. So what about the documentation's reference to a 30 million document limit? As it turns out: this is an error in Google's documentation. (For now, the error is still in the "Guide to Software Release 6.0", but I've been told this will be corrected.) According to Google, there is no hardwired limit to the number of documents you can index using multiple machines (as long as you buy lots and lots of Appliances to do it on, of course).

Secondly, about the difference between indexing 10-digit phone numbers or 40mb PDFs: I've been told that the Appliance's hardware is carefully over-spec'ed to handle the load Google claims it can deal with. (The Dell PowerEdge R710s the vendor ships would out-perform many commodity servers). My 40mb comment was a bit of a jab: an Appliance won't index documents larger than 30mb. But as Google explained, the limit has been set so they can guarantee that when they say a GB-7007 can index 10 million documents, it can actually index 10 million of those 30mb PDFs when that's what you need to do. And to be fair, if large documents are an issue for you, you'll want to read our Search & Information Access Report product evaluations carefully, since most enterprise search products have similar limits.

In the end, of course, the proof will be in the pudding: even if the software is capable of tying together 38 appliances to index a billion documents, this may not mean you'd actually want to. What are minor issues on a smaller corpus suddenly become major problems on that scale, and I'm looking forward to seeing how real enterprises are faring in deploying a cluster of GSAs for such high volumes.

And if anything: you still shouldn't believe the hype. Google's "billion document index" headline was syndicated across hundreds of news sources before even Google itself found out its documentation contradicted this. You'll want to be sure to get your information from a reliable source.


Our customers say...

"I've seen a lot of basic vendor comparison guides, but none of them come close to the technical depth, real-life experience, and hard-hitting critiques that I found in the Search & Information Access Research. When I need the real scoop about vendors, I always turn to the Real Story Group."


Alexander T. Deligtisch, Co-founder & Vice President, Spliteye Multimedia
Spliteye Multimedia

Other Posts