A large number of new features have been introduced since it was introduced in Solr 1.3.0. Here's a quick look at the major new features:
Error Handling & Rollback
Ability to control behavior on errors was an oft-request feature in DataImportHandler. With Solr 1.4, DataImportHandler provides configurable error handling options for each entity. You can specify the following as an attribute on the "entity"
- onError="abort" - Aborts the import process
- onError="skip" - Skips the current document
- onError="continue" - Continues as if the error never occurred
Event Listeners
An API is exposed to write listeners for import start and end. A new interface called EventListener has been introduced which has a single method:
public void onEvent(Context ctx);
For example, the listener can be specified as:
<document onImportStart="com.foo.StartListener" onImportEnd="com.foo.EndListener">Push data to Solr through DataImportHandler
In Solr 1.3, DataImportHandler was pull based only. If you wanted to push data to Solr e.g. through a HTTP POST request, you had no choice but to convert it to Solr's update XML format or CSV format. That meant that all the DataImportHandler goodness was not available. With Solr 1.4, a new DataSource named ContentStreamDataSource allows one to push data to Solr through a regular POST request.
Suppose one wants to push the following XML to Solr and use DataImportHandler to parse and index:
<root>
<b>
<id>1</id>
<c>Hello C1</c>
</b>
<b>
<id>2</id>
<c>Hello C2</c>
</b>
</root>
We can use ContentStreamDataSource to read the XML pushed to Solr through HTTP POST:
<dataConfig>
<dataSource type="ContentStreamDataSource" name="c"/>
<document>
<entity name="b" dataSource="c" processor="XPathEntityProcessor"
forEach="/root/b">
<field column="desc" xpath="/root/b/c"/>
<field column="id" xpath="/root/b/id"/>
</entity>
</document>
</dataConfig>
More Power to Transformers
New flag variables have been added which can be emitted by custom Transformers to skip rows, delete documents or stop further transforms.
New DataSources
FieldReaderDataSource - Reads data from an entity's field. This can be used, for example, to read XMLs stored in databases. - ContentStreamDataSource - Accept HTTP POST data in a content stream (described above)
- PlainTextEntityProcessor - Reads from any DataSource and outputs a String
- MailEntityProcessor (experimental) - Indexes mails from POP/IMAP sources into a solr index. Since it required extra dependencies, it is available as a separate package called "solr-dataimporthandler-extras".
- LineEntityProcessor - Streams lines of text from a given file to be indexed directly or for processing with transformers and child entities.
- HTMLStripTransformer - Strips HTML tags from input text using Solr's HTMLStripCharFilter
- ClobTransformer - Read strings from Clob types in databases.
- LogTransformer - Log data in a given template format. Very useful for debugging.
- Optimized defaults for database imports
- Delta imports consume less memory
- A 'deltaImportQuery' attribute has been introduced which is used for delta imports along with 'deltaQuery' instead of DataImportHandler manipulating the SQL itself (which was error-prone for complex queries). Using only 'deltaQuery' without a 'deltaImportQuery' is deprecated and will be removed in future releases.
- The 'where' attribute has been deprecated in favor of 'cacheKey' and 'cacheLookup' attributes making CachedSqlEntityProcessor easier to understand and use.
- Variables placed in DataSources, EntityProcessor and Transformer attributes are now resolved making very dynamic configurations possible.
- JdbcDataSource can lookup javax.sql.DataSource using JNDI
- A revamped EntityProcessor APIs for ease in creating custom EntityProcessors
A big THANKS to all the contributors and users who have helped us by giving patches, suggestions and bug reports!
Future Roadmap
Once Solr 1.4 is released, there are a slew of features targeted for Solr 1.5, including:
- Multi
-threaded indexing - Integration with Solr Cell to import binary and/or structured documents such as Office, Word, PDF and other proprietary formats
DataImportHandler as an API which can be used for creating Lucene indexes (independent of Solr) and as a companion to Solrj (for true push support). It will also be possible to extend it for other document oriented, de-normalized data stores such as CouchDB. - Support for reading Gzipped files
- Support for scheduling imports
- Support for Callable statements (stored procedures)
2 comments:
Great article. Thanks Shalin. I have one question though. Is deltaQuery deprecated or has it's usage just changed from solr 1.3?
I've been using a pre release Solr 1.4 version of the DataImportHandler. As in the wiki example here, http://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example, I use deltaQuery to get a list of the indexed ids for rows which have been updated in my database and then pass those ids into the deltaImportQuery. Is this functionality scheduled to change in the final release of Solr 1.4?
Thanks again,
Tim Garafola
Thanks Tim.
I'm sorry I wasn't very clear. Using only deltaQuery without specifying deltaImportQuery is the approach which has been deprecated. Using deltaImportQuery along with deltaQuery (as you are doing) is the preferred approach.
I've updated the post to reflect this.
Post a Comment