Shalin Says...: What's new in DataImportHandler in Solr 1.4

DataImportHandler is a Apache Solr module that provides a configuration driven way to import data from databases, XML and other sources into Solr in both "full builds" and incremental delta imports.

A large number of new features have been introduced since it was introduced in Solr 1.3.0. Here's a quick look at the major new features:

Error Handling & Rollback

Ability to control behavior on errors was an oft-request feature in DataImportHandler. With Solr 1.4, DataImportHandler provides configurable error handling options for each entity. You can specify the following as an attribute on the "entity" tag:

onError="abort" - Aborts the import process
onError="skip" - Skips the current document
onError="continue" - Continues as if the error never occurred

All errors are still logged regardless of the selected option. When an import aborts, either due to an error or a user command, all changes to the index since the last commit are rolled back.

Event Listeners

An API is exposed to write listeners for import start and end. A new interface called EventListener has been introduced which has a single method:

public void onEvent(Context ctx);

For example, the listener can be specified as:

<document onImportStart="com.foo.StartListener" onImportEnd="com.foo.EndListener">

Push data to Solr through DataImportHandler

In Solr 1.3, DataImportHandler was pull based only. If you wanted to push data to Solr e.g. through a HTTP POST request, you had no choice but to convert it to Solr's update XML format or CSV format. That meant that all the DataImportHandler goodness was not available. With Solr 1.4, a new DataSource named ContentStreamDataSource allows one to push data to Solr through a regular POST request.

Suppose one wants to push the following XML to Solr and use DataImportHandler to parse and index:


<root>
<b>
<id>1</id>
<c>Hello C1</c>
</b>
<b>
<id>2</id>
<c>Hello C2</c>
</b>
</root>

We can use ContentStreamDataSource to read the XML pushed to Solr through HTTP POST:


<dataConfig>
<dataSource type="ContentStreamDataSource" name="c"/>
<document>
<entity name="b" dataSource="c" processor="XPathEntityProcessor"
      forEach="/root/b">
<field column="desc" xpath="/root/b/c"/>
<field column="id" xpath="/root/b/id"/>
</entity>
</document>
</dataConfig>

More Power to Transformers

New flag variables have been added which can be emitted by custom Transformers to skip rows, delete documents or stop further transforms.

New DataSources

FieldReaderDataSource - Reads data from an entity's field. This can be used, for example, to read XMLs stored in databases.
ContentStreamDataSource - Accept HTTP POST data in a content stream (described above)

New EntityProcessors

PlainTextEntityProcessor - Reads from any DataSource and outputs a String
MailEntityProcessor (experimental) - Indexes mails from POP/IMAP sources into a solr index. Since it required extra dependencies, it is available as a separate package called "solr-dataimporthandler-extras".
LineEntityProcessor - Streams lines of text from a given file to be indexed directly or for processing with transformers and child entities.

New Transformers

HTMLStripTransformer - Strips HTML tags from input text using Solr's HTMLStripCharFilter
ClobTransformer - Read strings from Clob types in databases.
LogTransformer - Log data in a given template format. Very useful for debugging.

Apart from the above new features, there have been numerous bug fixes, optimizations and refactorings. In particular:

Optimized defaults for database imports
Delta imports consume less memory
A 'deltaImportQuery' attribute has been introduced which is used for delta imports along with 'deltaQuery' instead of DataImportHandler manipulating the SQL itself (which was error-prone for complex queries). Using only 'deltaQuery' without a 'deltaImportQuery' is deprecated and will be removed in future releases.
The 'where' attribute has been deprecated in favor of 'cacheKey' and 'cacheLookup' attributes making CachedSqlEntityProcessor easier to understand and use.
Variables placed in DataSources, EntityProcessor and Transformer attributes are now resolved making very dynamic configurations possible.
JdbcDataSource can lookup javax.sql.DataSource using JNDI
A revamped EntityProcessor APIs for ease in creating custom EntityProcessors

There are many more changes, see the changelog for the complete list. There's a new DIHQuickStart wiki page which can help you get started faster by providing cheat sheet solutions. Frequently asked questions along with their answers are recorded in the new DataImportHandlerFaq wiki page.

A big THANKS to all the contributors and users who have helped us by giving patches, suggestions and bug reports!

Future Roadmap

Once Solr 1.4 is released, there are a slew of features targeted for Solr 1.5, including:

Multi-threaded indexing
Integration with Solr Cell to import binary and/or structured documents such as Office, Word, PDF and other proprietary formats
DataImportHandler as an API which can be used for creating Lucene indexes (independent of Solr) and as a companion to Solrj (for true push support). It will also be possible to extend it for other document oriented, de-normalized data stores such as CouchDB.
Support for reading Gzipped files
Support for scheduling imports
Support for Callable statements (stored procedures)

If you have any feature requests or contributions in mind, do let us know on the solr-user mailing list.

2 comments:

Unknown said...: Great article. Thanks Shalin. I have one question though. Is deltaQuery deprecated or has it's usage just changed from solr 1.3?

I've been using a pre release Solr 1.4 version of the DataImportHandler. As in the wiki example here, http://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example, I use deltaQuery to get a list of the indexed ids for rows which have been updated in my database and then pass those ids into the deltaImportQuery. Is this functionality scheduled to change in the final release of Solr 1.4?

Thanks again,
Tim Garafola; 11:19 PM
Shalin Shekhar Mangar said...: Thanks Tim.

I'm sorry I wasn't very clear. Using only deltaQuery without specifying deltaImportQuery is the approach which has been deprecated. Using deltaImportQuery along with deltaQuery (as you are doing) is the preferred approach.

I've updated the post to reflect this.; 9:08 PM

Shalin Says...

Monday, September 28, 2009

What's new in DataImportHandler in Solr 1.4

2 comments:

About Me

Blog Archive

Labels

Twitter Updates

Twitter Updates

Recently shared stories

Recent questions on Apache Solr

Recent development in Apache Solr

Shalin Says...

Monday, September 28, 2009

What's new in DataImportHandler in Solr 1.4

Share this post:

2 comments:

About Me

Blog Archive

Labels

Twitter Updates

Twitter Updates

Recently shared stories

Recent questions on Apache Solr

Recent development in Apache Solr