Content cleanup in the former East Germany

There's no time like the holidays for catching up on back issues of The Economist (don't worry, we're baking cookies, too), and this morning I found myself engrossed by a tale of pattern matching. No, not pattern matching of snowflakes or Christmas knits, but of a set of documents ripped into 600 million pieces by East Germany's State Security Service (better known as the Stasi), back when the Berlin wall was being torn down and the mob was at the gates. The Stasi were afraid of documents falling into the wrong hands, so when the shredders failed, they frantically resorted to tearing up documents piece by piece. And you thought getting your enterprise search engine to pull off late-binding security was tough?

In a project currently underway at Berlin's Fraunhofer Institute for Production Systems and Design Technology, software is being used to find patterns in these millions of Stasi-created fragments of paper and re-assemble them, jigsaw-puzzle style. In going through the fragments, the software is grouping the scanned shreds of paper together by identifying patterns in handwriting, color, paper texture, even ink color. Then, once a group of related shreds is found, the software puzzles the papers together. In their haste, the Stasi actually helped this process quite a bit -- most of the fragments of the same document were found in the same bag. Or bucket. Category. Taxonomy facet, if you will.

Like enterprise search tools that perform some sort of text mining and subsequent clustering -- such as Autonomy, FAST or Endeca -- this software has the capacity to learn and refine what it puts together, identifying new content as more or less like the original items in the set. When it gets confused (such as when a document has distorted or torn edges), it refers the act of judgement to a human being. But what's especially interesting about this software is that it actually spawns slightly altered versions of itself that compete for computer time on the basis of success at finding matches. Now that's something I'd love to see from my local enterpise search vendor.

There's a few lessons to be learned here. First, this is a multi-year project with dedicated resources, which is more than most companies are willing to commit to their own document scanning and indexing efforts. Second, while pattern matching may seem like an exact way to search for things, there's always factors in play that require judgement and refinement -- be it subtle linguistic differences, synonyms, or even how someone happened to tear something up.

And finally -- although history will surely welcome the Stasi's carelessness -- you should never take content security and storage lightly. You may think content is "secure enough," until you realize just how good your new enterprise search tool is at indexing all your content, but how bad it is at tying into your ACLs and showing the right results only to those who should see them.

Now, why can't I get my snowflake cookies to all look exactly alike?


Our customers say...

"I've seen a lot of basic vendor comparison guides, but none of them come close to the technical depth, real-life experience, and hard-hitting critiques that I found in the Search & Information Access Research. When I need the real scoop about vendors, I always turn to the Real Story Group."


Alexander T. Deligtisch, Co-founder & Vice President, Spliteye Multimedia
Spliteye Multimedia

Other Posts