Time to Tame the Apache Menagerie

08-Dec-2009

Subscribers to our Search and Information Access Research are well aware that we've been increasing our coverage of Apache Lucene lately, in keeping with the phenomenal -- and still growing -- popularity of Apache's well-known open-source search engine.

This has led to a coverage conundrum (of sorts) for us, inasmuch as it is no longer possible to cover Lucene properly without also devoting a good deal of discussion to closely related projects like Nutch and (especially) Solr. This becomes problematic at times, not just because we're in essence covering multiple projects under one conceptual umbrella, but because the functional and architectural boundaries between things like Lucene, Nutch, and Solr -- though well understood by developers -- are easily blurred in a semi-technical writeup unless special care is taken to distinguish between concepts like search server, search engine, crawlers versus parsers, etc.

Some of these bits are unique to Lucene (the "engine" part, for example, consisting of the indexer and query framework), whereas others are unique to Solr (e.g., the "query server" bits that handle data-fetching and -passing over HTTP), whereas other bits (like UI widgets for faceted search) aren't there at all -- you have to build them yourself.

In short, as we expand our coverage of Lucene, we find ourselves investing ever-greater amounts of time and care in tiptoeing the conceptual boundaries around Solr, Nutch, Lucene, Hadoop, and so on. We think we do a pretty good job. But it's surprising how many people (including us, at times) still have trouble keeping the various pieces of the Apache search world straight.

Our job isn't made easier by the Apache Foundation's laissez-faire attitude toward project naming, which has led to an out-of-control zoo of projects with some sensical but oftentimes nonsensical names like Hadoop, Mahout, Tika, Lenya, James, Mina... and the list goes on.

There's a longstanding tradition in R&D (and elsewhere, of course) of using whimsical, short, purposely obscure code names for projects early in their lifetimes. And that's fine for prototypes and pre-release versions of software. But a mature product needs a mature name, preferably something descriptive and apropos. For example, Droids is not an entirely inappropriate name for Apache's autonomous-robots project. It's at least semantically aligned with the domain. But even if you know enough Hindi to figure out that Mahout is a term for the driver of an elephant, you're not likely to divine that it is also an open-source project for distributed machine learning algorithms on the Hadoop platform (and you shouldn't then be forced to look up what Hadoop means, and so on).

So, Suggestion No. 1 for Apache: When a project graduates from incubation, give it a real name.

It would also help if Apache namespaced subprojects and/or related projects in a logical fashion -- a fashion that shows the relationship. For example, would it hurt to call Solr "Lucene Search Server" -- or at least "Lucene Solr"? Solr is, after all, strictly dependent on Lucene, much the way Sling is dependent on Jackrabbit.

Suggestion No. 2: Make dependencies evident in project names. It helps people understand what the projects are about.

If the world is headed toward a Lucene-* stack (as it surely is), wouldn't it be nice to be able to refer to it that way? If people are having a hard time understanding that Solr is a search server, wouldn't it make sense to put "server" in the name? Bottom line, a rational namespace for Apache projects would be a big win for all concerned.

Those of us who regularly tiptoe the boundaries around Apache's zoo of related projects would like to occasionally REST our feet.

Real Story Group

Strong opinions. Candid advice.

Time to Tame the Apache Menagerie

Other Posts

Strong opinions. Candid advice.

Receive monthly notices about webinars and other digital leadership goodies

Other Posts