What is Apache Kafka? In short, it is a way of moving data between systems – for example, between applications, and servers.
It is often used to make multiple systems talk to each other smoothly: an intermediary between multiple data producers and consumers.
Despite this, it is not underpinned by a centralised process.
Rather, it is typically run as a cluster on one or more servers across multiple datacenters.
What is Apache Kafka Used for?
Kafka was originally used to underpin feeds of website activity (e.g. page views, searches, or other actions).
It was designed to handle high volume activity tracking.
It is also used for operational monitoring data, and to collect log files off servers and put them in a central place.
Many businesses deploy it to underpin an external commit-log for a distributed system (working as a re-syncing mechanism to restore data on failed nodes).
Who Created Apache Kafka?
Kafka was originally developed by LinkedIn.
It was open sourced in 2011, and graduated as a top-level Apache project in October 2012.
Seven years later, it remains one of the Apache Software Foundation’s top five projects, alongside Hadoop, Lucene, POI, ZooKeeper.
Thousands of companies are built heavily on Kafka, from Netflix to Airbnb, via LinkedIn.
In the UK, it underpins real-time analytics and predictive maintenance for British Gas.
The Kafka micro-site lists six main distributions:
The Confluent Platform
The Cloudera Kafka distribution
The Stratio Kafka source for ubuntu , and for RHEL
IBM Event Streams, built on Apache Kafka
The Strimzi distribution
As Confluent puts it: “At its heart lies the humble, immutable commit log, and from there you can subscribe to it, and publish data to any number of systems or real-time applications.
“Unlike messaging queues, Kafka is a highly scalable, fault tolerant distributed system, allowing it to be deployed for applications like managing passenger and driver matching at Uber.”
It has four key APIs
The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
What is Apache Kafka? And What is It Used For?
One of its most popular uses now is for stream processing, or querying continuous data streams to detect changing conditions.
Many users of Kafka process data in pipelines consisting of multiple stages, where raw input data is consumed and then aggregated, enriched, or otherwise transformed into new “topics” for further follow-up processing.
A stream processing library called Kafka Streams is available in Apache Kafka to perform this kind of data processing.
The Netherlands’ Rabobank uses Apache Kafka Streams to alerts customers in real-time to financial event.
Pinterest uses Apache Kafka and the Kafka Streams to power real-time, predictive budgeting in its advertising infrastructure.
The New York Times uses it to store and distribute, in real-time content to the various applications and systems it provides to users.
A rich ecosystem of APIs, plugins, management consoles and cloud integrations has built up around Kafka. These can be found here.
Documentation for developers to get started is here.