Thoughts about technology, business and all that's life.

This blog has moved to

Monday, April 20, 2009

Burst of activity in Lucene

There is a large amount of work being done in Lucene 2.9, in which a large portion is related to adding support for near real-time search.

To put it very simply, search engines transfer a lot of work from query-time to index-time. The reason this is done, is to speed up queries at the cost of adding documents slower. Until now, Lucene based systems have had problems with dealing with scenarios in which the searchers need to see the changes instantly (think Twitter Search). There exist a variety of tricks and techniques to acheive this even now. However, near real-time search support in Lucene itself is a boon to all those people who have been building and managing such systems because the grunt work will be done by Lucene itself.

This is still under development and will probably take a few more months to mature. Solr will benefit from it as well but before that can happen, a lot of work will be needed under the hood particularly in the way Solr handles its caching.

Michael McCandless has summarized the current state of Lucene trunk in this email on java-dev mailing list. In fact, there is so much activity that, at times, it becomes very difficult to follow all the excellent discussions that go on. There are some very talented people on that forum and it is a lot of learning for a guy like me, who started with Solr and is still trying to find his way in the Lucene code base.

Lucene 2.9 will bring huge improvements and I'm looking forward to working with other Solr developers to integrate them with Solr.

Wednesday, April 08, 2009

Google App Engine and Maven

Google has announced support for building Java applications on the App Engine platform. This is great news for new App Engine developers and especially for those Java developers who had to learn Python to use App Engine.

I created a project for App Engine using Maven for builds. These were the steps I needed to follow:

1. Publish the App Engine libraries to the local Maven repository. Goto the app-engine-java-sdk directory (where app-engine sdk is installed) and execute the following commands:
mvn install:install-file -Dfile=lib/appengine-tools-api.jar -DartifactId=appengine-tools -Dversion=1.2.0 -Dpackaging=jar -DgeneratePom=true

mvn install:install-file -Dfile=lib/shared/appengine-local-runtime-shared.jar -DartifactId=appengine-local-runtime-shared -Dversion=1.2.0 -Dpackaging=jar -DgeneratePom=true

mvn install:install-file -Dfile=lib/user/appengine-api-1.0-sdk-1.2.0.jar -DartifactId=appengine-sdk-1.2.0-api -Dversion=1.2.0 -Dpackaging=jar -DgeneratePom=true

mvn install:install-file -Dfile=lib/user/orm/ -DgroupId=org.datanucleus -DartifactId=datanucleus-appengine -Dpackaging=jar -DgeneratePom=true

2. Create a maven pom file. This is the one that I used:

<project xmlns=""





3. Create the standard maven directory structure and add the pom.xml in the same directory as the src directory.

You're done!

I tested this with a simple servlet based application and it worked fine. I did not test the JPA/JDO integration so it might be a little rough around the edges. But it should work for the most part. Note, App Engine supports Java 6. If you want to use Java 6, you can change the "source" and "target" in the build section to 1.6 instead of 1.5

Apache Mahout 0.1 Released

Apache Mahout 0.1 has been released. Apache Mahout is a project which attempts to make machine learning both scalable and accessible. It is a sub-project of the excellent Apache Lucene project which provides open source search software.

This is also the first public release of Taste collaborative filtering project ever since it was donated to Apache Mahout last year.

From the official announce email:
The Apache Lucene project is pleased to announce the release of Apache Mahout 0.1. Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. The first public release includes implementations for clustering, classification, collaborative filtering and evolutionary programming.

Highlights include:
  1. Taste Collaborative Filtering
  2. Several distributed clustering implementations: k-Means, Fuzzy k-Means, Dirchlet, Mean-Shift and Canopy
  3. Distributed Naive Bayes and Complementary Naive Bayes classification implementations
  4. Distributed fitness function implementation for the Watchmaker evolutionary programming library
  5. Most implementations are built on top of Apache Hadoop ( for scalability
Look at the announcement for more details -

There is a lot of interest in Mahout from the community and it had a successful year with the Google Summer of Code 2008 program. This year again, there have been multiple proposals and I'm sure that great things are on the way.

The Apache Mahout Wiki has a lot of good documentation on the project as well as on machine learning in general. Their mailing list is very active and of course, they have some great people involved, see the committers page. I would encourage every student interested in machine learning to participate in the project.

I wish good luck to the project and the people involved in it. Keep up the great work!

Tuesday, April 07, 2009

Tagging and Excluding Filters

Multi-select faceting is a new feature in the, soon to be released, Solr 1.4. It introduces support for tagging and excluding filters which enables us to request facets on a super-set of results from Solr.

The Problem

Out-of-the-box support for faceted search is a very compelling enhancement that Solr provides on top of Lucene. I highly recommend reading through the excellent article by Yonik on faceted search at Lucid Imagination's website, if you are not familiar with it.

Faceting on a field provides a list of (term,document-count) pairs for a given field. However, the returned facet results are always calculated on the current resultset. Therefore, whatever the current results are, the facets are always in sync with the results. This is both an advantage as well as a disadvantage.

Let us take the search UI for finding used vehicles on the website. There are facets on the seller's location and the vehicle's model. Let us assume that the Solr query to show that page looks like the following:

What happens when you select a model by clicking on, say "Impala"? The facet for vehicle model disappears. Why? The reason is that now only "Impala" is being shown and there are no other models present in the current result set. The Solr query looks like the following now:

So what is wrong with this? Nothing really. Except that for ease of navigation, you may still want to show all other models and document-counts which were being shown in the super-set of the current results (the previous page). But, as we noted a while back, the facets are shown for the current result set, in which all the models are Impala. If we attempt to facet on models field with the filter query applied, we will get a list of all models. But, except for "Impala", all other models will have a zero document count.

Solution #1 - Make another Solr query

Make another call to Solr without the filter query to get the other values. Our example query would look like:
The rows=0 is specified because we don't really want the actual results, just the facets for the model field. This is a solution that can be used with any version of Solr. However, it is one additional HTTP request. Even though it is a bit inconvenient, this is usually fast enough. However, an additional call is expensive if you are using Solr's Distributed Search which will send one or more queries to each shard.

Solution #2 - Tag and exclude filters

This is where multi-select faceting support comes in handy. With Solr 1.4, it is possible to tag the filter queries with a name. Then we can exclude one or more tagged queries when requesting for facets. All of this happens through additional metadata that is added to request parameters through a syntax called Local Params.

Let us go step-by-step and change the query in the above example and see how the request to Solr will look like.

1. The original request in the above example without tagging:
2. The filter query tagged with 'impala':
3. The facet field with the 'impala' filter query excluded:
Now, with this one query, you can get the facets for current results as well as for the super-set without the need to make another call to Solr. If you want Solr to return this particular facet field under an alternate name, you can add a 'key=alternative-name' local param. For example, the following Solr query will return the 'models' facet under the name of 'allModels':
q=chevrolet&facet=true&facet.field=location&facet.field={!ex=impala key=allModels}model&facet.mincount=1&fq={!tag=impala}model:Impala
Tagging, filtering and renaming is not just limited to facet fields. It can be used with facet queries, facet prefixes and date faceting too.

This is another cool contribution by Yonik (also see my previous post). I'm really looking forward to the Solr 1.4 release. It is bringing a bunch of very useful features including the super-easy-to-setup Java based replication. But more on that in a later post.

Sunday, April 05, 2009

Inside Solr: Improvements in Faceted Search Performance

Yonik Seeley recently implemented a new method for faceting which will be available in Solr 1.4 (yet to be released). It is optimized for faceting on multi-valued fields with large number of unique terms but relatively low number of terms per document. The new method has made a large improvement in performance for faceted search and has cut memory usage at the same time.


When you facet on a field, Solr gets the list of terms in a field across all documents, executes a Filter Query for each term, caches the set of documents matching the filter, intersects it against the current result set and gives the count of documents matched for each term after the intersection. This works great for fields which have few unique values. However, it requires a large amount of memory and time when the field has a large number of unique values.


The new method uses an UnInvertedField data structure. In very basic terms, for each document, it maintains a list of term numbers that are contained in that document. There is some pre-computation involved in building up this data structure, which is done lazily for each field, when needed. If a term is contained in too many documents, it is not un-inverted. In this new method, when you facet on a field, Solr iterates through all the documents, summing up the number of occurrences of each term. The terms which were skipped while building the data structure use the older set intersection method during faceting.

This data structure is very well optimized. It doesn't really store the actual terms (string). Each term number is encoded as a variable-length delta from the previous term number. A TermIndex is used to convert term numbers into the actual value for only those terms which are needed after faceting is completed (the current page of facet results). The concept is simple but if not implemented in an efficient way, it may impair performance rather than improve it. Therefore, there are a lot of important optimizations in the code.


Yonik benchmarked the performance of the new method against the old and his tests show a lot of improvement in faceting performance, sometimes by an order of magnitude (upto 50x). The improvement is much more significant as the number of unique tokens are increased.

For a comprehensive performance study, see the comment on the Jira issue about performance here and the document here.

There are a few ideas in the code comments which give directions on possible future optimizations. But the improvement from the old method are already quite massive, probably the law of diminishing returns will hold true here.

The structure is thrown away and re-created lazily on a commit. There might be a few concerns around the garbage accumulated by the (re)-creation of the many arrays needed for this structure. However, the performance gain is significant enough to warrant the trade-off.


The new method has been made the default method for faceting on non-boolean fields in the trunk code. It will be released with Solr 1.4 but it is already available in the trunk and nightly builds. If you are comfortable using the nightly builds, you are welcome to try it out.

A new request parameter has been introduced to switch to the old method if needed. Use facet.method=fc for the new method (default) and facet.method=enum for the old one.

Note - "Inside Solr" is a new feature that I hope to write regularly. It is intended to give updates about new features or improvements in Solr and at the same time, to describe the implementation details in a simple way. I invite you to give feedback through comments and tell me about what you would want to read about Solr.

Wednesday, April 01, 2009

The architecture behind popular websites

Sharing a few interesting articles I read in the past few weeks on the interweb about Twitter, LinkedIn, Ebay and Google.

Improving running components at Twitter describes the evolution of Twitter's technology and about their new message queue server, named Kestrel, written in approximately 1.5K lines of Scala.

LinkedIn Communication Architecture details the heavy usage of Java, Tomcat, Jetty, Lucene, Spring and ActiveJMX at LinkedIn. Oracle and MySQL are used for data storage. They have made heavy customizations to Lucene for their near real-time indexing needs. They have open-sourced their Lucene modifications in the form of Zoie on Google Code. The upcoming Lucene In Action 2 has a case-study on how Zoie builds upon Lucene.

The eBay way is a presentation on eBay's realtime personalization system. This mammoth system handles 4 billion reads/writes per day. The interesting thing about this system is that it uses the MySQL memory engine as a caching tier in front of a persistent tier. Some critical data is replicated (presumably on the cache tier as they talk about doubling memory needs). They encountered problems with the single-threaded MySQL replication, so it is managed through dual writes instead (the second write can be asynchronous). The system is capable of automatic redistribution of data if a node goes down.

Jeff Dean's WSDM keynote slides on the evolution of Google's search infrastructure are perhaps the most interesting of all. It has gone through a number of iterations over the years. I was surprised to know that their complete index is served out of memory. Although it makes sense with the fact that as they increased the number of nodes, they crossed a point where they had enough combined memory to hold the index completely.

About Me

My photo
Committer on Apache Solr. Principal Software Engineer at AOL.

Twitter Updates

    follow me on Twitter

    Recently shared stories

    Recent questions on Apache Solr

    Recent development in Apache Solr