Thoughts about technology, business and all that's life.

This blog has moved to http://shal.in.

Monday, September 28, 2009

What's new in DataImportHandler in Solr 1.4

DataImportHandler is a Apache Solr module that provides a configuration driven way to import data from databases, XML and other sources into Solr in both "full builds" and incremental delta imports.

A large number of new features have been introduced since it was introduced in Solr 1.3.0. Here's a quick look at the major new features:

Error Handling & Rollback

Ability to control behavior on errors was an oft-request feature in DataImportHandler. With Solr 1.4, DataImportHandler provides configurable error handling options for each entity. You can specify the following as an attribute on the "entity" tag:
  1. onError="abort" - Aborts the import process
  2. onError="skip" - Skips the current document
  3. onError="continue" - Continues as if the error never occurred
All errors are still logged regardless of the selected option. When an import aborts, either due to an error or a user command, all changes to the index since the last commit are rolled back.

Event Listeners

An API is exposed to write listeners for import start and end. A new interface called EventListener has been introduced which has a single method:

public void onEvent(Context ctx);

For example, the listener can be specified as:
<document onImportStart="com.foo.StartListener" onImportEnd="com.foo.EndListener">
Push data to Solr through DataImportHandler

In Solr 1.3, DataImportHandler was pull based only. If you wanted to push data to Solr e.g. through a HTTP POST request, you had no choice but to convert it to Solr's update XML format or CSV format. That meant that all the DataImportHandler goodness was not available. With Solr 1.4, a new DataSource named ContentStreamDataSource allows one to push data to Solr through a regular POST request.

Suppose one wants to push the following XML to Solr and use DataImportHandler to parse and index:

<root>
<b>
<id>1</id>
<c>Hello C1</c>
</b>
<b>
<id>2</id>
<c>Hello C2</c>
</b>
</root>

We can use ContentStreamDataSource to read the XML pushed to Solr through HTTP POST:

<dataConfig>
<dataSource type="ContentStreamDataSource" name="c"/>
<document>
<entity name="b" dataSource="c" processor="XPathEntityProcessor"
forEach="/root/b">
<field column="desc" xpath="/root/b/c"/>
<field column="id" xpath="/root/b/id"/>
</entity>
</document>
</dataConfig>

More Power to Transformers

New flag variables have been added which can be emitted by custom Transformers to skip rows, delete documents or stop further transforms.

New DataSources
  • FieldReaderDataSource - Reads data from an entity's field. This can be used, for example, to read XMLs stored in databases.
  • ContentStreamDataSource - Accept HTTP POST data in a content stream (described above)
New EntityProcessors
  • PlainTextEntityProcessor - Reads from any DataSource and outputs a String
  • MailEntityProcessor (experimental) - Indexes mails from POP/IMAP sources into a solr index. Since it required extra dependencies, it is available as a separate package called "solr-dataimporthandler-extras".
  • LineEntityProcessor - Streams lines of text from a given file to be indexed directly or for processing with transformers and child entities.
New Transformers
  • HTMLStripTransformer - Strips HTML tags from input text using Solr's HTMLStripCharFilter
  • ClobTransformer - Read strings from Clob types in databases.
  • LogTransformer - Log data in a given template format. Very useful for debugging.
Apart from the above new features, there have been numerous bug fixes, optimizations and refactorings. In particular:
  • Optimized defaults for database imports
  • Delta imports consume less memory
  • A 'deltaImportQuery' attribute has been introduced which is used for delta imports along with 'deltaQuery' instead of DataImportHandler manipulating the SQL itself (which was error-prone for complex queries). Using only 'deltaQuery' without a 'deltaImportQuery' is deprecated and will be removed in future releases.
  • The 'where' attribute has been deprecated in favor of 'cacheKey' and 'cacheLookup' attributes making CachedSqlEntityProcessor easier to understand and use.
  • Variables placed in DataSources, EntityProcessor and Transformer attributes are now resolved making very dynamic configurations possible.
  • JdbcDataSource can lookup javax.sql.DataSource using JNDI
  • A revamped EntityProcessor APIs for ease in creating custom EntityProcessors
There are many more changes, see the changelog for the complete list. There's a new DIHQuickStart wiki page which can help you get started faster by providing cheat sheet solutions. Frequently asked questions along with their answers are recorded in the new DataImportHandlerFaq wiki page.

A big THANKS to all the contributors and users who have helped us by giving patches, suggestions and bug reports!

Future Roadmap

Once Solr 1.4 is released, there are a slew of features targeted for Solr 1.5, including:
  • Multi-threaded indexing
  • Integration with Solr Cell to import binary and/or structured documents such as Office, Word, PDF and other proprietary formats
  • DataImportHandler as an API which can be used for creating Lucene indexes (independent of Solr) and as a companion to Solrj (for true push support). It will also be possible to extend it for other document oriented, de-normalized data stores such as CouchDB.
  • Support for reading Gzipped files
  • Support for scheduling imports
  • Support for Callable statements (stored procedures)
If you have any feature requests or contributions in mind, do let us know on the solr-user mailing list.

Saturday, September 26, 2009

Apache Lucene 2.9 Released


Apache Lucene 2.9 has been released. Apache Lucene is a high performance, full-featured text search engine library written entirely in Java.

From the official announce email:

Lucene 2.9 comes with a bevy of new features, including:
  • Per segment searching and caching (can lead to much faster reopen among other things)
  • Near real-time search capabilities added to IndexWriter
  • New Query types
  • Smarter, more scalable multi-term queries (wildcard, range, etc)
  • A freshly optimized Collector/Scorer API
  • Improved Unicode support and the addition of Collation contrib
  • A new Attribute based TokenStream API
  • A new QueryParser framework in contrib with a core QueryParser replacement impl included.
  • Scoring is now optional when sorting by Field, or using a custom Collector, gaining sizable performance when scores are not required.
  • New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)
  • New fast-vector-highlighter for large documents
  • Lucene now includes high-performance handling of numeric fields. Such fields are indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values.
  • And many, many more features, bug fixes, optimizations, and various improvements.
Look at the release announcement for more details.

Congratulations to the Lucene team! Great work as always.

This is also the last minor release which supports Java 1.4 platform. The next release will be 3.0 with which deprecated APIs will be removed and Lucene will officially move to Java 5.0 as the minimum requirement.

Solr 1.4 is not far behind and we hope to release it within two weeks.

About Me

My photo
Committer on Apache Solr. Principal Software Engineer at AOL.

Twitter Updates

    follow me on Twitter

    Recently shared stories

    Recent questions on Apache Solr

    Recent development in Apache Solr