There is a large amount of work being done in Lucene 2.9, in which a large portion is related to adding support for near real-time search.
To put it very simply, search engines transfer a lot of work from query-time to index-time. The reason this is done, is to speed up queries at the cost of adding documents slower. Until now, Lucene based systems have had problems with dealing with scenarios in which the searchers need to see the changes instantly (think Twitter Search). There exist a variety of tricks and techniques to acheive this even now. However, near real-time search support in Lucene itself is a boon to all those people who have been building and managing such systems because the grunt work will be done by Lucene itself.
This is still under development and will probably take a few more months to mature. Solr will benefit from it as well but before that can happen, a lot of work will be needed under the hood particularly in the way Solr handles its caching.
Michael McCandless has summarized the current state of Lucene trunk in this email on java-dev mailing list. In fact, there is so much activity that, at times, it becomes very difficult to follow all the excellent discussions that go on. There are some very talented people on that forum and it is a lot of learning for a guy like me, who started with Solr and is still trying to find his way in the Lucene code base.
Lucene 2.9 will bring huge improvements and I'm looking forward to working with other Solr developers to integrate them with Solr.
Thoughts about technology, business and all that's life.
This blog has moved to http://shal.in.
Monday, April 20, 2009
Subscribe to:
Post Comments (Atom)
About Me
- Shalin Shekhar Mangar
- Committer on Apache Solr. Principal Software Engineer at AOL.
Labels
- Apache Solr (8)
- Apache Lucene (3)
- Apache Mahout (3)
- AOL (1)
- Architecture (1)
- DataImportHandler (1)
- Faceted Search (1)
- Google App Engine (1)
- Inside Solr (1)
- Machine Learning (1)
- Optimization (1)
- Scalability (1)
2 comments:
Nice to see this being added. The long turnaround time for updates was the thing that forced us away from Lucene in a project a couple of years ago.
Indeed. This is a hard problem to solve for the general case. A variety of techniques are being tried and benchmarked. The benchmark module is also being enhanced simultaneously to create automated and repeatable tests.
In a recent test, a writer was asked for a reader every 3 seconds, with 700ms being taken to re-open the underlying reader. The good news is that the re-open time remained fairly constant as the size of the index increased.
Post a Comment