Apache Lucene, the full-text search library, has operated and been maintained for more than 20 years and for many developers is an integral part of their website and application builds. Essentially Apache Lucene is a full-text search engine software library that provides a Java-based search and indexing platform.
Using Java it lets you add search capabilities to websites or applications. It takes content and adds it to a full-text index which can then be used to perform queries. The content that is added to the index can be ingested from any number of sources such as a SQL/NoSQL databases or even from the website itself.
The software was first written in Java back in 1999 by Doug Cutting before the platform joined the Apache Software Foundation in 2001. To this day it is still one of the most active projects within the Apache Foundation family.
Last year alone, nine versions were released, four committers were made project management committee members and seven community members became official committers. Currently users can get a version of it written in the following programming languages; Perl, C++, Python, Object Pascal, Ruby and PHP.
Forks include a version from Benipal Technologies who state: “We are heavy Lucene users and have forked the Lucene / SOLR source code to create a high volume, high performance search cluster with MapReduce, HBase and katta integration, achieving indexing speeds as high as 3000 Documents per second with sub 20ms response times on 100 Million + indexed documents.”
One of the main reason that Apache Lucene is considered in such high regard is that it can return search responses quickly.
It does so because instead of searching text or content directly, it instead searches an index which has been created in relation to that content. Known as an inverted index it works in a similar manner as the index of a book. The engine itself is incredibly robust, and while the engine is commonly used in a one thread per query manner when initiating a search, the engine can actually execute a single query concurrently using multiple threads.
PMC member and committer for the Apache Lucene project Michael McCandless explains this in detail in a blog that states: “Lucene’s IndexSearcher class, responsible for executing incoming queries to find their top matching hits from your index, accepts an optional Executor (e.g. a thread pool) during construction.
“If you pass an Executor and your CPUs are idle enough (i.e. your server is well below its red-line QPS throughput capacity), Lucene will use multiple concurrent threads to find the top overall hits for each query.”