View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Hardware
December 11, 2019

The Top Five Apache Software Projects in 2019: From Kafka to Zookeeper

"We are heavy Lucene users and have forked the Lucene / SOLR source code to create a high volume, high performance search cluster with MapReduce"

By CBR Staff Writer

The Apache Foundation is 20 years old this year and has grown to the point where it now supports over 350 open source projects; all maintained by a community of more than 770 individual members and 7,000 committers distributed across six continents. Here are the Top Five Apache Software projects in 2019, as listed by the foundation.

Apache Projects

Top Five Apache Software Projects in 2019

1: Hadoop

Apache ProjectsReleased in 2006, Apache Hadoop is an open source software library used to run distributed processing of large datasets on computers using simple programing models. A key feature of Hadoop is that the library will detect and handle failures at the application level. Essentially it’s a framework that facilities distributed big data storage and big data processing.

Read this: An Idiot’s Guide to Kubernetes

The Java-based programming framework consists of a storage element called Hadoop Distributed File System. The file system splits large files into blocks which are then spread out across different nodes in a computer cluster. Hadoop Common creates the main framework as its holds all of the common libraries and files that support the Hadoop modules.

Since Hadoop has the most active visits and downloads out of all of Apache’s software offerings it’s no surprise that a long list of companies rely on it for their data storage and processing needs.

One such user is Adobe, which notes: “We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We constantly write data to Apache HBase and run MapReduce jobs to process then store it back to Apache HBase or external systems.”

Apache Projects

2: Kafka

Apache Kafka – developed in 2011 – is a distributed streaming platform that lets developers publish and subscribe record streams in a method similar to a message queue. Kafka is used to build data pipelines that can stream in real-time, it is also used to create applications that can react or transform according to a ingested real-time data stream.

Content from our partners
Scan and deliver
GenAI cybersecurity: "A super-human analyst, with a brain the size of a planet."
Cloud, AI, and cyber security – highlights from DTX Manchester

Kafka is writing in Scala and Java programming languages. When it stores streams of records in a cluster it calls them topics, each topic consists of a value, a key and a timestamp. It runs using four key APIs; Producer, Consumer, Streams and Connector. Kafka is used by many companies as a fault-tolerant publish-subscribe messaging system as well as means to run real-time analytics on data streams.

The open-source software is used by Linkedin – which incidentally first developed the software platform – to activity stream data and operation metrics. Twitter use it as part of its processing and archival infrastructure: “Because Kafka writes the messages it receives to disk and supports keeping multiple copies of each message, it is a durable store. Thus, once the information is in it we know that we can tolerate downstream delays or failures by processing, or reprocessing, the messages later.”

3: Lucene

Apache ProjectsLucene is a search engine software library that provides a java-based search and indexing platform. The engine can process ranked searching as well as a number of query types such as phrase queries, wildcard queries, proximity queries and range queries. Apache estimate text indexed using Lucene is done at 20-30 percent of its original size.

Lucene was first written in Java back in 1999 by Doug Cutting before the platform joined the Apache Software Foundation in 2001. Users can now get a version of it writing in the following programming languages; Perl, C++, Python, Object Pascal, Ruby and PHP.

Lucene is used by Benipal Technologies which states: “We are heavy Lucene users and have forked the Lucene / SOLR source code to create a high volume, high performance search cluster with MapReduce, HBase and katta integration, achieving indexing speeds as high as 3000 Documents per second with sub 20 ms response times on 100 Million + indexed documents.”

4: POI

Apache ProjectsPOI is an open-source API that is used by programmers to manipulate file formats related to Microsoft Office such as Office Open XML standards and Microsoft’s OLE 2 Compound Document format. With POI; programmes can create, display and modify Microsoft Office files using Java programs.

The German railway company Deutsche Bahn is among the major users, creating a software toolchain in order to establish a pan-European train protection system.

A part of that chain is a “domain-specific specification processor which reads the relevant requirements documents using Apache POI, enhances them and ultimately stores their contents as ReqIF. Contrary to DOC, this XML-based file format allows for proper traceability and versioning in a multi-tenant environment. Thus, it lends itself much better to the management and interchange of large sets of system requirements. The resulting ReqIF files are then consumed by the various tools in the later stages of the software development process.”

The name POI is an acronym for “Poor Obfuscation Implementation” which was the original developers making a joke that the file formats they handled appear to be deliberately obfuscated.

5: ZooKeeper

Apache ProjectsZooKeeper is a centralised service that is used for maintaining configuration information. It’s a service for distributed systems and acts as a hierarchical key-value store, which is used for storing, manage and retrieving data. Essentially ZooKeeper is used to synchronise applications that are distributed across a cluster.

Working in conjunction with Hadoop it effectively works like a centralised repository where distributed applications can store and retrieve data.

AdroitLogic a enterprise integration and B2B service provider state that they use: “ZooKeeper to implement node coordination, in clustering support. This allows the management of the complete cluster, or any specific node – from any other node connected via JMX. A Cluster wide command framework developed on top of the ZooKeeper coordination allows commands that fail on some nodes to be retried etc.”

See Also: You Could Pee These Files, or Store them in a 3D Printed Rabbit

Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how Progressive Media Investments may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.