Challenge of Scale within the Enterprise, Part 2 - Search

Following on from Tony's opening missive on scalability in the modern enterprise, let's turn to the challenge of enterprise search at scale.

There are probably some of you thinking, "Ah well, we've got scale nailed. For we are steeped in the history of very big things." These would be the people for who think of information in terms of its weight.

Weight you see is a very tempting measure of information. For one it seems precise: we have X petabytes of stuff. And said stuff is secure and safely tucked up on its load-balanced servers, chilled by air conditioning and fed through a system of uninterruptible power supplies. For another, weight is easy to report upon, plan for, and generally treat as a fait accompli just in and of itself.

Credit where credit is due, it was Gartner's Doug Laney who first coined the now widely used triumvirate of "Volume, Velocity, and Variety" -- the 3Vs -- back in 2001 to describe what he believed were the emerging challenges for Enterprises around data sources, structures and formats. 

A decade or so on, those challenges have not gone away or been magically solved. Indeed, as I've discussed a few times of late, they are fueling the current vendor gold-rush to have an answer to what is now snappily referred to as "Big Data."  However, the three Vs -- here augmented with a couple of additions of my own -- still represent a useful way of examining the issues of scale within the enterprise with reference to search technologies, and are far more enlightening than simply just weight.

Volume, Velocity and Variety - The Original Gangsters

Ten years on from this trio's inception, they are worth revisiting and perhaps defining in a concise way with respect to enterprise search. None is more important or more pressing than any other -- most organizations will see the pressures exerted by each to a greater or lesser extent -- but each presents a distinct challenge for the enterprise, even if treated in isolation.

Volume

We do like to lavish so much attention in the creation of our content that we're really reluctant to delete any of it. I say this as someone who in his own filing system (and I use the term loosely) retains documentation for household gadgets that I no longer own, receipts for work on properties I've not lived in for a decade, and at least 5 years worth of birthday cards. My own content retention policy therefore, needs a little work. Excessive retention is only one factor that seems to affect volume; we're creating more content, and that content (in the case of say video) often requires exponentially higher amounts of physical space to store.

The obvious result of these factors means that while the scale of the data that we might wish to store continues to escalate, the knock-on effect of applying technology designed to retrieve specific bits of information now has to cope with larger indexes, as well as greater precision in being able to determine what it is that you are after.

Velocity

Not only are we creating more content, but the rate (frequency) at which we are doing this is also increasing. General technological adoption within the workplace -- allied with new collaboration, social, and messaging technology -- puts added pressure on the aforementioned retention. Perhaps more importantly, it has created additional use cases for enterprise search, especially around compliance and discovery.

If this alone wasn't enough of a challenge, the modern field marketeer's toolkit now adds new complexity for enterprise architects. Including externally-sourced social media data into the mix -- for applications such as Social Media Monitoring -- not only adds volume, but huge extra velocity in terms of the rapidity that these new data sets get updated and queried.

Variety

As I've already discussed, the variety of the content required in our indexes is changing. The complexity of this variety is actually pretty well understood.  Connectors that can traverse a wide variety of repositories to collect information have become fairly mature.

But what is changing is the sheer number of those repositories, both those within the enterprise and those outside. Additionally, many of those that are employed within the enterprises inconveniently reside physically outside your firewall.  Think Salesforce for CRM to Yammer for microblogging. These tools often expose an API of varying degrees of quality for integration purposes, but in general (and ignoring my previous warning on APIs) just managing the number of sources, their data formats, and any necessary transformations required is a headache -- even more so at scale.

Veracity and Vocabulary - The New Guns

As the challenges presented by that original trio multiply at scale, I'll add another couple of candidates to the stack:

  1. Veracity: how do you manage the accuracy and "truth" of the content you are indexing a
  2. Vocabulary: how do you cope with complexities in language (both regional and technical)

Veracity

If we only consider intra-organizational data -- that created within the enterprise -- you face a massive challenge to interrogate information for authority. There are occasions when of course the most recent voice on a subject can proxy for the most authoritative. But relying purely upon recency when pouring though search results for definitive knowledge alone is far from ideal. Or consider another attempt: for a long time SharePoint Search overweighted documents based on their proximity to root folders.  For most customers, that didn't turn out too well.

Authority is a complex concept, made up of not only the information itself, but also who produced it, when it was produced and of course, what might have occurred subsequently. We're almost entirely dependent on data quality -- the governance of the information within an index or set of indexes -- to help guide us. Add to this the increasing focus on extra-organizational data (such as that created within social networks such as Twitter) and measuring authority becomes even more daunting.

Vocabulary

Today, very few enterprises of any scale are contained within the borders of one linguistic territory, and even those that define a single "business language" will naturally still generate a good deal of important information and communication in local languages.

On top of national language vocabularies, we have technical and organizational vocabularies which may or may not be shared across all these languages. Indeed, some are so arcane and complex that they might be considered almost anti-languages of their own, making it impossible to assume that you can easiliy to map them into a workable hierarchy for clustering results.

What does all this mean for search at scale?

The crux of what I'm suggesting is that search at any size already means addressing many difficult things. When you make those difficult thing bigger -- either in importance to a line of business or in terms of sheer scale -- then you end up with a big difficult thing. You don't have to be an analyst to work that sum out.

It becomes vital then to understand firstly what you are trying to achieve, what your panacea looks like, before you start to even think about acquiring tools. For each data source you throw into enterprise indexes, ask yourself: why is it needed ? How can we measure the quality of what it contains and gain a proper understanding of what it means in the context of the questions your colleagues are posing?

Scale increases the weight factor for sure, but Occam's razor still applies. On the enterprise data side, experts learned long ago not to treat all data equally; apply the same sharp focus to your unstructured information as well.


Our customers say...

"I've seen a lot of basic vendor comparison guides, but none of them come close to the technical depth, real-life experience, and hard-hitting critiques that I found in the Search & Information Access Research. When I need the real scoop about vendors, I always turn to the Real Story Group."


Alexander T. Deligtisch, Co-founder & Vice President, Spliteye Multimedia
Spliteye Multimedia

Other Posts