Shalin Says...: 2009

Thursday, November 26, 2009

Apache Lucene Java 3.0 Released

Apache Lucene Java 3.0.0 has been released. Lucene Java 3.0.0 is mostly a clean-up release without any new features. It paves the path for refactoring and adding new features without the shackles of backwards compatibility. All APIs deprecated in Lucene 2.9 have been removed and Lucene Java has officially moved to Java 5 as the minimum requirement.

See the announcement email for more details. Congratulations Lucene Devs!

Wednesday, November 18, 2009

Apache Mahout 0.2 Released

Apache Mahout 0.2 has been released. Apache Mahout is a project which attempts to make machine learning both scalable and accessible. It is a sub-project of the excellent Apache Lucene project which provides open source search software.

From the project website:

The Apache Lucene project is pleased to announce the release of Apache Mahout 0.2.
Highlights include:
Significant performance increase (and API changes) in collaborative filtering engine
K-nearest-neighbor and SVD recommenders
Much code cleanup, bug fixing
Random forests, frequent pattern mining using parallel FP growth
Latent Dirichlet Allocation
Updates for Hadoop 0.20.x
Details on what's included can be found in the release notes.
Downloads are available from the Apache Mirrors

Tuesday, November 10, 2009

Apache Solr 1.4 Released

From the official announcement:

Apache Solr 1.4 has been released and is now available for public download!
http://www.apache.org/dyn/closer.cgi/lucene/solr/

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of
many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

New Solr 1.4 features include

Major performance enhancements in indexing, searching, and faceting
Revamped all-Java index replication that's simple to configure and can replicate configuration files
Greatly improved database integration via the DataImportHandler
Rich document processing (Word, PDF, HTML) via Apache Tika
Dynamic search results clustering via Carrot2
Multi-select faceting (support for multiple items in a single category to be selected)
Many powerful query enhancements, including ranges over arbitrary functions, and nested queries of different syntaxes
Many other plugins including Terms for auto-suggest, Statistics, TermVectors, Deduplication

Performance Enhancements

Revamped All-Java Replication

SolrReplication wiki page
Works on Microsoft Windows Platforms too!

DataImportHandler improvements

Rich document processing

Dynamic Search Results Clustering

Multi-select Faceting

Query Enhancements

Ranges over functions
Nested query support for any type of query parser (via QParserPlugin). Quotes will often be necessary to encapsulate the nested query if it contains reserved characters. Example: _query_:"{!dismax qf=myfield}how now brown cow"

New Plugins

TermsComponent (can be used for auto-suggest)
TermVectorComponent
Statistics
Deduplication

SolrJ - Java client

Faster, more efficient Binary Update format
Javabean (POJO) binding support
Fast multi-threaded updates through StreamingUpdateSolrServer
Simple round-robin load balancing client - LBHttpSolrServer
Stream documents through an Iterator API
Many performance optimizations

Miscellaneous

Rollback command in UpdateHandler
More configurable logging through the use of SLF4J library
'commitWithin' parameter on add document command allows setting a per-request auto-commit time limit.
TokenFilter factories for Arabic language
Improved Thai language tokenization (SOLR-1078)
Merge multiple indexes
Expunge Deletes command

Upgrade instructions

Although Solr 1.4 is backwards-compatible with previous releases, users are encouraged to read the upgrading notes in the Solr Change Log.

There are so many more new features, optimizations, bug fixes and refactorings that it is not possible to cover them all in a single blog post.

A large amount of effort has gone into this release. Many congratulations to the entire Solr community for making this happen!

Great things are planned for the next release and it is a great time to get involved. See http://wiki.apache.org/solr/HowToContribute for how to get started.

Enjoy Solr 1.4 and let us know on the mailing lists if you have any questions!

Tuesday, October 27, 2009

Why you should contribute to Open Source

Note: The following material and presentation was prepared for students of the Indian Institute of Information Technology (IIIT), Allahabad. The aim was to get them excited about contributing to open source projects and in particular about Apache Lucene, Solr and Hadoop. The first talk was titled "Why you should contribute to Open Source" and was aimed at freshmen and has no technical content. The second was titled "Get involved with the Apache Software Foundation" and was given to sophomore, junior and senior students and it goes into some basic technical information on Apache Lucene, Solr and Hadoop projects. The following post comprises of some notes that I put together for the talks.

Work on what you like, when you like

Everybody wants to work on "cool" products. However, the reality is that most of you will get stuck in a job which although may pay well, it will hardly be about the things you wanted to work on. In your course, you will learn about algorithms, distributed systems, natural language processing, information retrieval, bio-informatics and other areas of computer science and its applications but in real life, the majority of the work done in software companies will have little direct application of things you will learn in your course.

Most of the times you will be using things built by others and writing glue code to build things needed by your company's business. This is not to say that all that knowledge will go waste; it will definitely help you become a better programmer and you should learn it but there's a fair chance that it may not be used directly in your job.

Open Source projects offer you a chance to work on something that you want rather than something that others want you to work on. It is a great opportunity to work on something that is both cool and useful as well as to associate with a well known brand and all the publicity and goodwill it brings. You are free to pick and choose between the thousands of open source projects out there. Moreover, you are free to decide on how much you want to contribute. You won't have a boss and you won't have the pressure of deadlines and schedules.

Development in the "real" world

Academic projects are insufficient to impart many of the skills that you'd need once you start developing software full-time. Many of these skills are "social" rather than technical in nature but are at least as important.

Most academic projects are "toy" projects. By that, I mean that their whole life cycle revolves around you. You are the designer, developer, tester and also the user. As a result, there are few key things missing in those projects.

No build system - Makefiles? Ant? Maven? Very few students are familiar with using them. Don't even ask about creating a build from scratch. "Hey! Just open those files in a text editor or an IDE and hack away" is not an unusual thing to be heard
No source control - CVS? SVN? Git? A single person writing all the code or >80% of the code is very common
No bug tracker - "It is never going to be used after we demo it to the professors"
No user documentation - maybe you will write a research paper detailing your findings but there is little or no documentation written for "other" people
No mailing lists or forums for support - Nobody but you is going to use it

Moreover, under these circumstances, you never learn how to:

Discuss technical design or issues in writing
Resolve conflicts in matters of design, architecture and a project's road map.
Build usable interfaces (whether command line options or a GUI or an API)
Write proper error handling and logging code
Identify hooks for monitoring systems in production
Think about backup and recovery
Identify components which can be extended or replaced to add or modify functionality of the system

Open source projects are the real deal. If you are involved for long enough, you will either see or be a part of many such discussions and conflicts. All of the above skills are things you will need when you get around to software development in the real world.

Learn from the best

How many great developers do you know about? How many of them work or have worked on an open source project? I bet there are many names common to both the lists.

Open Source development will help you observe how experienced developers work and their various ways of designing, coding and discussing solutions. You will learn new ideas and new ways of solving problems. The second and probably more important part is that many smart programmers will be looking over your code and will provide review comments which will help you improve yourself. You will learn more efficient or shorter (or both) ways to solve the same problem. That kind of feedback is invaluable to a budding programmer.

I know that I've learned a great deal since I got involved in Apache Solr.

Build a publicly verifiable resume

What you tell in your resume are things like contact information, performance in academia, programming languages you know, projects you've worked on and other such stuff. There is very little in this document which can be verified easily. This is a problem for you as well as for the prospective employer because:

It may not represent you, your skills and your hard work sufficiently enough
It makes hiring a game of chance for the prospective employer and prevents them from making more informed decisions

The best thing about contributing to an open source project is that everything you do is public. So you can say things like the following:

I have worked on this project for the last two years
I wrote features X, Y and Z on Project P
I have over two hundred posts on the user forum or mailing list
I have commit access to the project
I am the expert because "I wrote it"

And your prospective employer can search and verify such things easily. Congratulations, you have just landed on top of the stack of resumes!

Companies will find you

When a company evaluates that an open source Project X can save them a lot of money, it is likely that they will hire a few people who have experience on project X and can support its use internally. Many such companies also allow their developers to work on the project either part-time or full-time. And who else is more qualified to work on the project but you - an existing contributor!

More and more companies are starting up around providing training, consulting and support for open source projects. Many such companies exclusively hire existing contributors.

Even if an open source project is not used directly inside the company, many tech companies hire open source contributors because:

Hiring popular open source developers makes them more cool in the eyes of other developers
Developers who contribute to open source projects are good programmers

I'm sure there are many more reasons other than the ones I've given here. In the end, contributing to an open source project is a good investment of your time and it may well be your big ticket to finding that great job. Good Luck!

Monday, September 28, 2009

What's new in DataImportHandler in Solr 1.4

DataImportHandler is a Apache Solr module that provides a configuration driven way to import data from databases, XML and other sources into Solr in both "full builds" and incremental delta imports.

A large number of new features have been introduced since it was introduced in Solr 1.3.0. Here's a quick look at the major new features:

Error Handling & Rollback

Ability to control behavior on errors was an oft-request feature in DataImportHandler. With Solr 1.4, DataImportHandler provides configurable error handling options for each entity. You can specify the following as an attribute on the "entity" tag:

onError="abort" - Aborts the import process
onError="skip" - Skips the current document
onError="continue" - Continues as if the error never occurred

All errors are still logged regardless of the selected option. When an import aborts, either due to an error or a user command, all changes to the index since the last commit are rolled back.

Event Listeners

An API is exposed to write listeners for import start and end. A new interface called EventListener has been introduced which has a single method:

public void onEvent(Context ctx);

For example, the listener can be specified as:

<document onImportStart="com.foo.StartListener" onImportEnd="com.foo.EndListener">

Push data to Solr through DataImportHandler

In Solr 1.3, DataImportHandler was pull based only. If you wanted to push data to Solr e.g. through a HTTP POST request, you had no choice but to convert it to Solr's update XML format or CSV format. That meant that all the DataImportHandler goodness was not available. With Solr 1.4, a new DataSource named ContentStreamDataSource allows one to push data to Solr through a regular POST request.

Suppose one wants to push the following XML to Solr and use DataImportHandler to parse and index:


<root>
<b>
<id>1</id>
<c>Hello C1</c>
</b>
<b>
<id>2</id>
<c>Hello C2</c>
</b>
</root>

We can use ContentStreamDataSource to read the XML pushed to Solr through HTTP POST:


<dataConfig>
<dataSource type="ContentStreamDataSource" name="c"/>
<document>
<entity name="b" dataSource="c" processor="XPathEntityProcessor"
      forEach="/root/b">
<field column="desc" xpath="/root/b/c"/>
<field column="id" xpath="/root/b/id"/>
</entity>
</document>
</dataConfig>

More Power to Transformers

New flag variables have been added which can be emitted by custom Transformers to skip rows, delete documents or stop further transforms.

New DataSources

FieldReaderDataSource - Reads data from an entity's field. This can be used, for example, to read XMLs stored in databases.
ContentStreamDataSource - Accept HTTP POST data in a content stream (described above)

New EntityProcessors

PlainTextEntityProcessor - Reads from any DataSource and outputs a String
MailEntityProcessor (experimental) - Indexes mails from POP/IMAP sources into a solr index. Since it required extra dependencies, it is available as a separate package called "solr-dataimporthandler-extras".
LineEntityProcessor - Streams lines of text from a given file to be indexed directly or for processing with transformers and child entities.

New Transformers

HTMLStripTransformer - Strips HTML tags from input text using Solr's HTMLStripCharFilter
ClobTransformer - Read strings from Clob types in databases.
LogTransformer - Log data in a given template format. Very useful for debugging.

Apart from the above new features, there have been numerous bug fixes, optimizations and refactorings. In particular:

Optimized defaults for database imports
Delta imports consume less memory
A 'deltaImportQuery' attribute has been introduced which is used for delta imports along with 'deltaQuery' instead of DataImportHandler manipulating the SQL itself (which was error-prone for complex queries). Using only 'deltaQuery' without a 'deltaImportQuery' is deprecated and will be removed in future releases.
The 'where' attribute has been deprecated in favor of 'cacheKey' and 'cacheLookup' attributes making CachedSqlEntityProcessor easier to understand and use.
Variables placed in DataSources, EntityProcessor and Transformer attributes are now resolved making very dynamic configurations possible.
JdbcDataSource can lookup javax.sql.DataSource using JNDI
A revamped EntityProcessor APIs for ease in creating custom EntityProcessors

There are many more changes, see the changelog for the complete list. There's a new DIHQuickStart wiki page which can help you get started faster by providing cheat sheet solutions. Frequently asked questions along with their answers are recorded in the new DataImportHandlerFaq wiki page.

A big THANKS to all the contributors and users who have helped us by giving patches, suggestions and bug reports!

Future Roadmap

Once Solr 1.4 is released, there are a slew of features targeted for Solr 1.5, including:

Multi-threaded indexing
Integration with Solr Cell to import binary and/or structured documents such as Office, Word, PDF and other proprietary formats
DataImportHandler as an API which can be used for creating Lucene indexes (independent of Solr) and as a companion to Solrj (for true push support). It will also be possible to extend it for other document oriented, de-normalized data stores such as CouchDB.
Support for reading Gzipped files
Support for scheduling imports
Support for Callable statements (stored procedures)

If you have any feature requests or contributions in mind, do let us know on the solr-user mailing list.

Saturday, September 26, 2009

Apache Lucene 2.9 Released

Apache Lucene 2.9 has been released. Apache Lucene is a high performance, full-featured text search engine library written entirely in Java.

From the official announce email:

Lucene 2.9 comes with a bevy of new features, including:

Per segment searching and caching (can lead to much faster reopen among other things)
Near real-time search capabilities added to IndexWriter
New Query types
Smarter, more scalable multi-term queries (wildcard, range, etc)
A freshly optimized Collector/Scorer API
Improved Unicode support and the addition of Collation contrib
A new Attribute based TokenStream API
A new QueryParser framework in contrib with a core QueryParser replacement impl included.
Scoring is now optional when sorting by Field, or using a custom Collector, gaining sizable performance when scores are not required.
New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)
New fast-vector-highlighter for large documents
Lucene now includes high-performance handling of numeric fields. Such fields are indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values.
And many, many more features, bug fixes, optimizations, and various improvements.

Look at the release announcement for more details.

Congratulations to the Lucene team! Great work as always.

This is also the last minor release which supports Java 1.4 platform. The next release will be 3.0 with which deprecated APIs will be removed and Lucene will officially move to Java 5.0 as the minimum requirement.

Solr 1.4 is not far behind and we hope to release it within two weeks.

Monday, July 06, 2009

Thoughts on Tomcat

I saw an advertisement today for taking a survey on Tomcat to help define it's future directions. I don't usually click on ads but this one seemed interesting so I did. It was a short one (thanks guys!) so I didn't mind completing it. What I did not like so much was the focus on questions on how Tomcat can compete with "enterprise" application servers. What is "enterprise" anyway? If it's about performance then Tomcat is enterprise ready. It doesn't really matter if other commercial servers can do a few more requests per second on artificial tests. Tomcat is free (as in freedom)!. That's a huge advantage.

Tomcat is the most widely used application deployment platform at AOL. As with most other web companies, we don't really need all the cruft, cost and complexity which "enterprise" application servers bring in. It is a tried and tested platform with good performance characteristics, easy administration and monitoring. It scales well enough. We run thousands of Tomcats serving hundreds of millions of pages. Since it is free, we don't need to scale by buying expensive servers, we can just scale out by adding more low cost servers (which, by the way, also adds redundancy)

I don't think Tomcat's goal should be to add features (read complexity) to compete with the so called enterprise application servers. It should continue to focus on being a performant Servlet/JSP container with easy development, administration and monitoring support. What I'd like to see added to Tomcat is:

Easy ways to use Tomcat in an embedded fashion (like Jetty)
Improve Tomcat manager
Easy configuration for my webapp (Properties vs. JNDI/context)

I'm not a guy who adds features just for the heck of them. So I'll give use-cases for each of the above requests. These use-cases come from my own recent work.

Easy ways to use Tomcat in an embedded fashion

I don't see myself shipping a product with an embedded tomcat but I've frequently needed to have an embedded container for unit/integration testing REST APIs. Sometimes, I've used Jetty, other times I've mocked stuff. All of my production deployments use Tomcat so it is only natural that I use Tomcat for integration testing. Solr uses Jetty for testing and for providing a standalone example which works great. I like the easy embeddable nature of Jetty. However, I also believe that part of the reason behind the popularity of Jetty was that Tomcat was not embeddable enough. It had a lot of strongly coupled extra features (and a lot of related code) which were not needed. Valves, realms, JNDI contexts, authentication and clusters are things which are generally not needed in the embeddable scenarios. Note that embedding Tomcat is possible but it is not documented that well and there is no easy way to find out all the dependency jars I'd need to do that. The last time I did this successfully with Maven, I had to track down the dependencies myself and add each one to my application to make it work. So easy for me means: publish latest jars to Maven, use the dependency structure that Maven provides, make it easy for me to remove the extra features I don't need, focus on keeping the artifacts smaller and have good documentation on how to use the API.

Improve Tomcat Manager

A few months ago, I worked on a deployment application to push code updates to Tomcat servers across data centers. The use-case is simple. I want to update my application's code without causing a downtime. So I drain traffic away from individual servers, update the code, verify that it is in fact updated, and redirect traffic back to the server. I worked with the features provided by the Tomcat manager application. Not too many people actually use the Manager in the name of security but that's a separate topic. I wanted to add some custom commands to the manager and I couldn't because it was not designed to be extensible. In the end, I had to copy code from Tomcat's sources and modify to make it work. This is an area which could use some improvement. Coupled with good documentation on how to securely use the manager application, it has the potential to be used more widely. I want to use the Tomcat manager application from certain whitelisted IPs and only with SSL. Sounds simple but it was damn hard to get it working the way we wanted.

Easy configuration for my webapp

Configuration is a difficult issue. There are always so many right ways depending on who you ask. I just need to provide some key/value pairs to my application which change rarely but when they do, I'd really like them to be reloaded without bringing my application down. I'd really like to push those into the war but then I'd have different wars for different environments (dev/qa/prod) and that'd make some people very nervous (why?). I could use JNDI but that is much more complicated to manage than it needs to be for my simple use-case. Sysadmins don't like XML, and that is a well-known fact. It's easier for everybody to modify properties files vs an XML file for simple key/value pairs. I want to hot reload them, just like Tomcat hot-loads wars dropped into the webapps directory but I guess you can't do that. So I write my own small sweet library to read properties files from a certain location, checking every few minutes for changes to the file. If Tomcat itself had something similar, I'd just use that. I think it might be a very common use-case.

On a related note, I'm very excited about tomcat-lite, comet/bayuex and the new servlet API (asynchronous servlets) coming into Tomcat. I also wish for an easy way to write non-HTTP applications on top of Tomcat's NIO stack (again de-coupling may help) but that maybe asking too much. I know it's all do-o-cracy and I'm not doing my part. Someday I hope to contribute code rather than just ideas and complaints. For now, this is all I have.

Thursday, May 28, 2009

Solr in PHP/Drupal, Ruby/Sunspot and Python/Haystack

Adoption of Apache Solr is accelerating. Being accessible though HTTP makes it possible for Solr (a Java webapp) to be used with any language. All you need is support for making HTTP calls and parsing one of the many available formats such as XML or JSON.

Drupal

Drupal is one of the most popular CMS available as open source. It is written in PHP and boasts of a huge user and developer base. Recently, the Drupal community has integrated Apache Solr into Drupal for vertical search. The integration is available as a Drupal module at http://drupal.org/project/apachesolr. There are some excellent tutorials available on how to get started with using this as well as a hosted solution by Acquia.

Ruby

Ruby integration has been present in Solr since a long time. There is a module called solr-ruby as well as acts_as_solr. Solr even has a ruby response writer which outputs search results serialized in ruby. Blacklight is an open source project I know that uses Solr and is built in Ruby. Today, I came to know about SunSpot - A Solr powered search engine for Ruby. More details at this article in LinuxMag.

Python

Solr has a python response writer as well as many clients. See http://wiki.apache.org/solr/SolPython for details. Reddit is one site that uses Solr with a python front-end application. There is also HayStack for Django which can use Solr among other engines such as Xapian and Whoosh.

Solr 1.4 is nearing release with a number of features and performance improvements. On the other hand, Lucene is getting ready for near real-time search as well. Things are getting interesting in the Solr world!

Monday, April 20, 2009

Burst of activity in Lucene

There is a large amount of work being done in Lucene 2.9, in which a large portion is related to adding support for near real-time search.

To put it very simply, search engines transfer a lot of work from query-time to index-time. The reason this is done, is to speed up queries at the cost of adding documents slower. Until now, Lucene based systems have had problems with dealing with scenarios in which the searchers need to see the changes instantly (think Twitter Search). There exist a variety of tricks and techniques to acheive this even now. However, near real-time search support in Lucene itself is a boon to all those people who have been building and managing such systems because the grunt work will be done by Lucene itself.

This is still under development and will probably take a few more months to mature. Solr will benefit from it as well but before that can happen, a lot of work will be needed under the hood particularly in the way Solr handles its caching.

Michael McCandless has summarized the current state of Lucene trunk in this email on java-dev mailing list. In fact, there is so much activity that, at times, it becomes very difficult to follow all the excellent discussions that go on. There are some very talented people on that forum and it is a lot of learning for a guy like me, who started with Solr and is still trying to find his way in the Lucene code base.

Lucene 2.9 will bring huge improvements and I'm looking forward to working with other Solr developers to integrate them with Solr.

Wednesday, April 08, 2009

Google App Engine and Maven

Google has announced support for building Java applications on the App Engine platform. This is great news for new App Engine developers and especially for those Java developers who had to learn Python to use App Engine.

I created a project for App Engine using Maven for builds. These were the steps I needed to follow:

1. Publish the App Engine libraries to the local Maven repository. Goto the app-engine-java-sdk directory (where app-engine sdk is installed) and execute the following commands:

mvn install:install-file -Dfile=lib/appengine-tools-api.jar -DgroupId=com.google -DartifactId=appengine-tools -Dversion=1.2.0 -Dpackaging=jar -DgeneratePom=true

mvn install:install-file -Dfile=lib/shared/appengine-local-runtime-shared.jar -DgroupId=com.google -DartifactId=appengine-local-runtime-shared -Dversion=1.2.0 -Dpackaging=jar -DgeneratePom=true

mvn install:install-file -Dfile=lib/user/appengine-api-1.0-sdk-1.2.0.jar -DgroupId=com.google -DartifactId=appengine-sdk-1.2.0-api -Dversion=1.2.0 -Dpackaging=jar -DgeneratePom=true

mvn install:install-file -Dfile=lib/user/orm/datanucleus-appengine-1.0.0.final.jar -DgroupId=org.datanucleus -DartifactId=datanucleus-appengine -Dversion=1.0.0.final -Dpackaging=jar -DgeneratePom=true

2. Create a maven pom file. This is the one that I used:


<project xmlns="http://maven.apache.org/POM/4.0.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.shalin</groupId>
    <artifactId>test</artifactId>
    <packaging>war</packaging>
    <version>1.0</version>
    <name>Test</name>
    <url>http://shalinsays.blogspot.com</url>
    
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>com.google</groupId>
            <artifactId>appengine-tools</artifactId>
            <version>1.2.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>com.google</groupId>
            <artifactId>appengine-local-runtime-shared</artifactId>
            <version>1.2.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>com.google</groupId>
            <artifactId>appengine-sdk-1.2.0-api</artifactId>
            <version>1.2.0</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <artifactId>standard</artifactId>
            <groupId>taglibs</groupId>
            <version>1.1.2</version>
            <type>jar</type>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <artifactId>jstl</artifactId>
            <groupId>javax.servlet</groupId>
            <version>1.1.2</version>
            <type>jar</type>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.geronimo.specs</groupId>
            <artifactId>geronimo-el_1.0_spec</artifactId>
            <version>1.0.1</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.geronimo.specs</groupId>
            <artifactId>geronimo-jsp_2.1_spec</artifactId>
            <version>1.0.1</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.geronimo.specs</groupId>
            <artifactId>geronimo-servlet_2.5_spec</artifactId>
            <version>1.2</version>
            <scope>provided</scope>
        </dependency>
        
        <dependency>
            <groupId>org.apache.geronimo.specs</groupId>
            <artifactId>geronimo-jpa_3.0_spec</artifactId>
            <version>1.1.1</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.geronimo.specs</groupId>
            <artifactId>geronimo-jta_1.1_spec</artifactId>
            <version>1.1.1</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.datanucleus</groupId>
            <artifactId>datanucleus-appengine</artifactId>
            <version>1.0.0.final</version>
            <scope>compile</scope>
        </dependency>

    </dependencies>
    <build>
        <finalName>test</finalName>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.5</source>
                    <target>1.5</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

3. Create the standard maven directory structure and add the pom.xml in the same directory as the src directory.

You're done!

I tested this with a simple servlet based application and it worked fine. I did not test the JPA/JDO integration so it might be a little rough around the edges. But it should work for the most part. Note, App Engine supports Java 6. If you want to use Java 6, you can change the "source" and "target" in the build section to 1.6 instead of 1.5

Apache Mahout 0.1 Released

Apache Mahout 0.1 has been released. Apache Mahout is a project which attempts to make machine learning both scalable and accessible. It is a sub-project of the excellent Apache Lucene project which provides open source search software.

This is also the first public release of Taste collaborative filtering project ever since it was donated to Apache Mahout last year.

From the official announce email:

The Apache Lucene project is pleased to announce the release of Apache Mahout 0.1. Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. The first public release includes implementations for clustering, classification, collaborative filtering and evolutionary programming.

Highlights include:
Taste Collaborative Filtering
Several distributed clustering implementations: k-Means, Fuzzy k-Means, Dirchlet, Mean-Shift and Canopy
Distributed Naive Bayes and Complementary Naive Bayes classification implementations
Distributed fitness function implementation for the Watchmaker evolutionary programming library
Most implementations are built on top of Apache Hadoop (http://hadoop.apache.org) for scalability

Look at the announcement for more details - http://www.nabble.com/-ANNOUNCE--Apache-Mahout-0.1-Released-td22937220.html

There is a lot of interest in Mahout from the community and it had a successful year with the Google Summer of Code 2008 program. This year again, there have been multiple proposals and I'm sure that great things are on the way.

The Apache Mahout Wiki has a lot of good documentation on the project as well as on machine learning in general. Their mailing list is very active and of course, they have some great people involved, see the committers page. I would encourage every student interested in machine learning to participate in the project.

I wish good luck to the project and the people involved in it. Keep up the great work!

Tuesday, April 07, 2009

Tagging and Excluding Filters

Multi-select faceting is a new feature in the, soon to be released, Solr 1.4. It introduces support for tagging and excluding filters which enables us to request facets on a super-set of results from Solr.

The Problem

Out-of-the-box support for faceted search is a very compelling enhancement that Solr provides on top of Lucene. I highly recommend reading through the excellent article by Yonik on faceted search at Lucid Imagination's website, if you are not familiar with it.

Faceting on a field provides a list of (term,document-count) pairs for a given field. However, the returned facet results are always calculated on the current resultset. Therefore, whatever the current results are, the facets are always in sync with the results. This is both an advantage as well as a disadvantage.

Let us take the search UI for finding used vehicles on the Vast.com website. There are facets on the seller's location and the vehicle's model. Let us assume that the Solr query to show that page looks like the following:

q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1

What happens when you select a model by clicking on, say "Impala"? The facet for vehicle model disappears. Why? The reason is that now only "Impala" is being shown and there are no other models present in the current result set. The Solr query looks like the following now:

q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq=model:Impala

So what is wrong with this? Nothing really. Except that for ease of navigation, you may still want to show all other models and document-counts which were being shown in the super-set of the current results (the previous page). But, as we noted a while back, the facets are shown for the current result set, in which all the models are Impala. If we attempt to facet on models field with the filter query applied, we will get a list of all models. But, except for "Impala", all other models will have a zero document count.

Solution #1 - Make another Solr query

Make another call to Solr without the filter query to get the other values. Our example query would look like:

q=chevrolet&facet=true&facet.field==model&facet.mincount=1&rows=0

The rows=0 is specified because we don't really want the actual results, just the facets for the model field. This is a solution that can be used with any version of Solr. However, it is one additional HTTP request. Even though it is a bit inconvenient, this is usually fast enough. However, an additional call is expensive if you are using Solr's Distributed Search which will send one or more queries to each shard.

Solution #2 - Tag and exclude filters

This is where multi-select faceting support comes in handy. With Solr 1.4, it is possible to tag the filter queries with a name. Then we can exclude one or more tagged queries when requesting for facets. All of this happens through additional metadata that is added to request parameters through a syntax called Local Params.

Let us go step-by-step and change the query in the above example and see how the request to Solr will look like.

1. The original request in the above example without tagging:

q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq=model:Impala

2. The filter query tagged with 'impala':

q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq={!tag=impala}model:Impala

3. The facet field with the 'impala' filter query excluded:

q=chevrolet&facet=true&facet.field=location&facet.field={!ex=impala}model&facet.mincount=1&fq={!tag=impala}model:Impala

Now, with this one query, you can get the facets for current results as well as for the super-set without the need to make another call to Solr. If you want Solr to return this particular facet field under an alternate name, you can add a 'key=alternative-name' local param. For example, the following Solr query will return the 'models' facet under the name of 'allModels':

q=chevrolet&facet=true&facet.field=location&facet.field={!ex=impala key=allModels}model&facet.mincount=1&fq={!tag=impala}model:Impala

Tagging, filtering and renaming is not just limited to facet fields. It can be used with facet queries, facet prefixes and date faceting too.

This is another cool contribution by Yonik (also see my previous post). I'm really looking forward to the Solr 1.4 release. It is bringing a bunch of very useful features including the super-easy-to-setup Java based replication. But more on that in a later post.

Sunday, April 05, 2009

Inside Solr: Improvements in Faceted Search Performance

Yonik Seeley recently implemented a new method for faceting which will be available in Solr 1.4 (yet to be released). It is optimized for faceting on multi-valued fields with large number of unique terms but relatively low number of terms per document. The new method has made a large improvement in performance for faceted search and has cut memory usage at the same time.

Background

When you facet on a field, Solr gets the list of terms in a field across all documents, executes a Filter Query for each term, caches the set of documents matching the filter, intersects it against the current result set and gives the count of documents matched for each term after the intersection. This works great for fields which have few unique values. However, it requires a large amount of memory and time when the field has a large number of unique values.

UnInvertedField

The new method uses an UnInvertedField data structure. In very basic terms, for each document, it maintains a list of term numbers that are contained in that document. There is some pre-computation involved in building up this data structure, which is done lazily for each field, when needed. If a term is contained in too many documents, it is not un-inverted. In this new method, when you facet on a field, Solr iterates through all the documents, summing up the number of occurrences of each term. The terms which were skipped while building the data structure use the older set intersection method during faceting.

This data structure is very well optimized. It doesn't really store the actual terms (string). Each term number is encoded as a variable-length delta from the previous term number. A TermIndex is used to convert term numbers into the actual value for only those terms which are needed after faceting is completed (the current page of facet results). The concept is simple but if not implemented in an efficient way, it may impair performance rather than improve it. Therefore, there are a lot of important optimizations in the code.

Performance

Yonik benchmarked the performance of the new method against the old and his tests show a lot of improvement in faceting performance, sometimes by an order of magnitude (upto 50x). The improvement is much more significant as the number of unique tokens are increased.

For a comprehensive performance study, see the comment on the Jira issue about performance here and the document here.

There are a few ideas in the code comments which give directions on possible future optimizations. But the improvement from the old method are already quite massive, probably the law of diminishing returns will hold true here.

The structure is thrown away and re-created lazily on a commit. There might be a few concerns around the garbage accumulated by the (re)-creation of the many arrays needed for this structure. However, the performance gain is significant enough to warrant the trade-off.

Conclusion

The new method has been made the default method for faceting on non-boolean fields in the trunk code. It will be released with Solr 1.4 but it is already available in the trunk and nightly builds. If you are comfortable using the nightly builds, you are welcome to try it out.

A new request parameter has been introduced to switch to the old method if needed. Use facet.method=fc for the new method (default) and facet.method=enum for the old one.

Note - "Inside Solr" is a new feature that I hope to write regularly. It is intended to give updates about new features or improvements in Solr and at the same time, to describe the implementation details in a simple way. I invite you to give feedback through comments and tell me about what you would want to read about Solr.

Wednesday, April 01, 2009

The architecture behind popular websites

Sharing a few interesting articles I read in the past few weeks on the interweb about Twitter, LinkedIn, Ebay and Google.

Improving running components at Twitter describes the evolution of Twitter's technology and about their new message queue server, named Kestrel, written in approximately 1.5K lines of Scala.

LinkedIn Communication Architecture details the heavy usage of Java, Tomcat, Jetty, Lucene, Spring and ActiveJMX at LinkedIn. Oracle and MySQL are used for data storage. They have made heavy customizations to Lucene for their near real-time indexing needs. They have open-sourced their Lucene modifications in the form of Zoie on Google Code. The upcoming Lucene In Action 2 has a case-study on how Zoie builds upon Lucene.

The eBay way is a presentation on eBay's realtime personalization system. This mammoth system handles 4 billion reads/writes per day. The interesting thing about this system is that it uses the MySQL memory engine as a caching tier in front of a persistent tier. Some critical data is replicated (presumably on the cache tier as they talk about doubling memory needs). They encountered problems with the single-threaded MySQL replication, so it is managed through dual writes instead (the second write can be asynchronous). The system is capable of automatic redistribution of data if a node goes down.

Jeff Dean's WSDM keynote slides on the evolution of Google's search infrastructure are perhaps the most interesting of all. It has gone through a number of iterations over the years. I was surprised to know that their complete index is served out of memory. Although it makes sense with the fact that as they increased the number of nodes, they crossed a point where they had enough combined memory to hold the index completely.

Tuesday, February 24, 2009

Helpful hints on Large Solr Indexes and Schema Design

Solr user Lance Norskog has been kind enough to contribute documentation on:

Very useful documentation which, no doubt, will be made more comprehensive with time.

Update - Mark Miller has written a very nice article on Scaling Lucene and Solr at the Lucid Imagination website.

Saturday, February 21, 2009

Google Summer of Code 2009 at Apache

Google Summer of Code program is back again this year and Apache is looking for students interested in contributing and making money with the program.

Apache Software Foundation received quite a few students with excellent proposals who did a lot of great work last year. Take a look at the last year's proposals to get a feel of the level of competition. I'm sure there would be quite a few this year as well. A wiki page has been put up which will list all the proposals.

You can come up with their own proposals as well and add it to the wiki. However, the ASF being a community driven eco-system, it is highly recommended that you drop a line to the project mailing lists and get feedback on your proposal. That way, you will have time to convince one or more committers to sign up as mentors for your proposal. They will help you develop your proposal as well as guide you along the project with regular reviews and feedback. If your proposal attracts no mentors, it cannot be accepted for the program.

Open Source is a different ball game than academic projects and the code itself is a small part. One needs to write unit tests to inspire confidence in the code before it can be incorporated in a project. If other developers are interested in your project, they'll want to collaborate with you. With each patch, you'll get review comments which you may need to incorporate. There are very few places, if any, where you can get such great feedback on your work and that too, absolutely free.

Users will need documentation and tutorials about your code before they can start using it. Sometimes, one also needs to create working examples to demonstrate usage and features. Users will ask questions on your features, post bug reports and suggest enhancements. It is the open source way to courteously answer them and guide them to solutions. As the feature matures, the community also benefits from best practices, FAQs and guidelines on performance optimization. Ultimately, it is well worth the effort to learn the open source way of developing software.

I've been thinking about a few features which can help Solr but more on that later. For now, see the announcement on solr-dev mailing list on GSOC 2009 and reply with your ideas if you are interested.

Grant has also written a useful post with advice to aspring GSOC participants on his blog.

Monday, February 16, 2009

Announcing my return to blogging

Yes, it has been a long long time since my last post. I guess I lost interest in writing about the myriad of things out there. But, I sure did not lose interest in reading and learning about them.

I work at AOL Bangalore Development Center as a Software Engineer on a variety of cool projects. Life is great, work is fun and I'm having a good time. They pay me to work on such interesting things that I'd probably do them for free anyways :)

A lot of my work revolves around open source projects chief among them being Apache Solr. I started using Apache Solr (more on this in another post) as part of my job. Through a stroke of luck, my colleague Noble Paul and I, built DataImportHandler for Solr and contributed it back. Between then and now, both of us have been actively involved in working on Solr as a small part of our day job and as a big part of our spare time. These days, I'm a committer on the project and spend a large amount of time on adding new features, fixing bugs and answering user queries on the user mailing list.

Expect a lot of posts related to search in general and Apache Solr in particular. From time to time, I'll keep writing posts on the myriad of things I keep doing and reading.