View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Technology
  2. Data
April 4, 2023updated 22 May 2023 11:02am

What is Apache Lucene?

First written in 1999 by Doug Cutting, Apache Lucene is still going strong

By Tech Monitor Staff

Apache Lucene is a full-text search library created by the Apache Software Foundation. Being a full-featured text search, Lucene aims to search a set of text documents for one or more keywords specified by the user.

But what is it exactly, and how does it work?

What is Apache Lucene?

Apache Lucene is an open-source and free program library. When it was first launched in 1999, the application was written completely in Java, but today it has extensions that can process other programming languages too.

However, while it was first written in Java by Doug Cutting, the platform then joined the Apache Software Foundation in 2001. To this day, it is still one of the most active projects within the Apache Foundation family.

Lucene allows its users to add search capabilities to websites or applications. It takes content and adds it to a full-text index which can then be used to perform queries. This content, consequently, can be ingested from any number of sources, such as a SQL database or even from the website itself.

What’s the use of Apache Lucene?

Lucene can work not only with the Internet, but also with archives, libraries, and PC servers, where it can handle and search HTML documents, e-mail and PDF files.

In order to understand how Lucene works, it is important to know that there are various components to it: an index, the documents that it works with and segments. The core of Lucene is the index, which is vital for the search since all documents are stored in it. In order for it to be functional, the user needs to extract the index list first and move all the needed documents to it. The elements that Lucene is interested in can be segmented into fields, containing information and keywords, such as author names, titles and file names.

Content from our partners
Scan and deliver
GenAI cybersecurity: "A super-human analyst, with a brain the size of a planet."
Cloud, AI, and cyber security – highlights from DTX Manchester

Once all the elements are transferred into the index, tokenisation happens. This creates segments, which make it possible for the search to take place by inputting a single term or a number. The most common and popular tokenisation is with the white space strategy, which entails that a term ends when a space occurs. For instance, if the words are “White car”, the white space between the two words signals the end of the search term. In addition, Lucene also converts all capital letters to lowercase ones to make the search simpler and more immediate.

The way the application actually works sees the user inputting a search term in a text line, the search terms are called queries. A request, on the other hand, means that the input must only consist of one or more words, alongside additions like “and”, “or” and symbols like a plus (+) or a minus (-).

Lucene is revolutionary when it comes to incremental indexing, which means that individual entries can be modified, added or removed without the need to do it in batch.

What’s the difference between Solr and Elasticsearch?

Both Apache Solr and Elastic‘s Elasticsearch are built on top of Apache Lucene and, therefore, offer similar features and functions.

Solr, for instance, is an open-source search server that offers the same functions as Lucene, but through HTTP requests instead. Elasticsearch is also an open-source search engine, and it works with a lot more coding languages like JSON and handles NoSQL data.

When it comes to use, Solr has always been more directed to enterprise-directed text searches with advanced information retrieval (IR). It works best with large amounts of static data, and it fits best corporations or large working groups, as it handles Rich Text Format (RTF) documents too. Elasticsearch, on the other hand, revolves around scaling, data analytics, and processing time series data to obtain meaningful insights and patterns. Rather than a corporation, it is better suited for modern web applications where data is in JSON format.

During the searching process, both Solr and Elasticsrearch manage near real-time (NRT) searches and base their skills on Lucene’s. Solr, for instance, includes a sample search UI, Velocity Search, which offers many capabilities such as searching, faceting, highlighting and Geo Search. Elasticsearch’s aggregation framework is slightly more powerful.

Read more: SAP strikes open data deal with Google Cloud ahead of ChatGPT integration








Topics in this article : , , , ,
Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how Progressive Media Investments may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.