Never Mind the Quality, Feel the Width - Big Data's emerging problem

Big Data is may be a buzzword, yet it's certainly generating interesting discussions. Over the last month or two, I been party to a number of really interesting sessions - such as the CW500 event I mentioned previously - and with recent acquisitions in this space, the question is becoming less about whether Big Data is possible, and more about how it can be applied in the enterprise.

The Problem of Data Quality for Unstructured Content

For me this raises the question of quality -- especially when dealing with unstructured data.

"On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question." (Charles Babbage, Passages from the Life of a Philosopher).

Babbage's thoughts on the subject of data quality, were neatly summarized by George Fuechsel a century or so later as, "garbage in, garbage out," or "GIGO."

Understanding of data quality in the world of structured data (think ERP, CRM, BI) has reached a very high level of maturity. Unfortunately, the same cannot be said for the world of unstructured data, a.k.a., content.  Ensuring that same level of quality for unstructured data such that it doesn't skew subsequent analysis is much harder to apply.

Use Cases on Offer

Listening to Oracle discussing the possible outcomes from Big Data, you hear many references to use cases such as "smart meters" in domestic scenarios, or medical sensing equipment attached to patients. These examples certainly when scaled-out can produce vast quantities of data, and almost certainly that data will provide valuable insight once analyzed.

I would argue though that these are pretty limited use cases that simply extend existing applications.

They ignore the massive amount of true content: from short-form social, to long-form document text. Is this because such content is inherently not useful, or that the problem of quality makes it too hard to glean actionable results?

Reading IBM's own commentary on how they plan to apply their new Vivisimo technology to this problem suggests that they have at least recognized the issue exists. IBM envisions Vivisimo as a kind of content curation tool: federating sources and assembling data sets that have been filtered for quality and faceted together into logical collections. However, while this appears to be sensible in theory, it begs a question.  Why Vivisimo, rather than their pre-existing Content Analytics/Omnifind technologies? Might Enterprise Search find a new role across the board in this emerging area?

What You Should Do

There is certainly right now a paucity of solid business cases for Big Data in the enterprise. Certainly not a shortage of ideas and theories, but customers are still primarily sandboxing sub-sets of data, looking for indications that there are a demonstrable returns on investment to come. As you look for suitable use cases, and your Big Data explorations turn more to unstructured data, remember GIGO and don't lose sight of data quality.


Our customers say...

"I've seen a lot of basic vendor comparison guides, but none of them come close to the technical depth, real-life experience, and hard-hitting critiques that I found in the Search & Information Access Research. When I need the real scoop about vendors, I always turn to the Real Story Group."


Alexander T. Deligtisch, Co-founder & Vice President, Spliteye Multimedia
Spliteye Multimedia

Other Posts