Apache Kafka is an open-source event streaming platform that enables organizations to build real-time data streams and store data. It was developed at LinkedIn in 2011 and subsequently made available as open source. Since then, the use of Kafka as an event streaming solution has become widespread.
How is Kafka structured?
Apache Kafka is built as a computer cluster consisting of servers and clients.
The servers, or so-called brokers, write data with timestamps into the topics, which creates the data streams. In a cluster, there can be different topics, which can be separated by the topic reference. In a company, for example, there could be a topic with data from production and again a topic for data from sales. The so-called broker stores these messages and can also distribute them in the cluster in order to share the load more evenly.
The writing system in the Kafka environment is called the producer. As a counterpart, there are the so-called consumers, which read the data streams and process the data, for example by storing them. When reading, there is a special feature that distinguishes Kafka: The consumer does not have to read the messages all the time, but can also be called only at certain times. The intervals depend on the data timeliness required by the application.
To ensure that the consumer really reads all messages, the messages are “numbered” with the so-called offset, i.e. an integer that starts with the first message and ends with the most recent. When we set up a consumer, it “subscribes” to the topic and remembers the earliest available offset. After it has processed a message, it remembers which message offsets it has already read and can pick up exactly where it left off the next time it is started.
This allows, for example, a functionality in which we let the consumer run every full hour for ten minutes. Then it registers that ten messages have not yet been processed and it reads these ten messages and sets its own offset equal to that of the last message. In the following hour, the consumer can then determine again how many new messages have been added and process them one after the other.
A topic can also be divided into so-called partitions, which can be used to parallelize processing because the partitions are stored on different computers. This also allows several people to access a topic at the same time and the total storage space for a topic can be scaled more easily.
What are the capabilities of Apache Kafka?
Apache Kafka offers two different types of topics, namely “Normal” and “Compacted ” topics. Normal topics have a maximum memory size and sometimes also a certain amount of time in which data is stored. When the memory limit is reached, the oldest events, i.e. those with the smallest offset, are deleted to make room for new events.
This is different from the Compacted Topics. These have neither a time nor a memory limitation. They are therefore similar to a database in that the data is retained. Because of these Compact Topics, some companies, such as Uber, are already using Kafka as a data lake. The complete data stream can also be queried using SQL via the additional KSQL functionality.
What applications can be implemented with Kafka?
Kafka is used to building and analyze real-time data streams. Thus, it can be used to implement applications where data timeliness plays an immense role. These include, among others:
- Production line monitoring
- Analysis of website tracking data in real-time
- Merging data from different sources
- Change data capturing from databases to detect changes
In addition, machine learning is also increasingly becoming the focus of Apache Kafka applications. An e-commerce website runs many ML models whose results must be available in real-time. For example, these websites have a recommendation function that suggests suitable products to the customer based on his or her previous journey.
Thus, the model needs the previous events in real time to start a calculation. The website, in turn, needs the result of the model as quickly as possible in order to display the found products to the customer. For these two data streams, the use of Apache Kafka lends itself.
What are the benefits of using Apache Kafka?
Kafka’s widespread popularity is due in part to these advantages:
- Scalability: Due to the architecture with topics and partitions, Kafka is horizontally scalable in many aspects such as storage space or performance.
- Possibility of data storage: The use of Compacted Streams offers the possibility to store data permanently.
- Ease of use: The concept of producers and consumers is easy to understand and implement.
- Fast processing: In real-time applications, not only the fast transport of data is of great importance, but also fast processing. Apache Kafka offers the so-called Streaming API an easy way to process real-time data.
This is what you should take with you
- Apache Kafka is an open-source event streaming platform that enables companies to build real-time data streams and store the data.
- In short, producers write their data into so-called topics, which can be read by consumers.
- In Topics, the oldest data is deleted once a certain amount of storage has been reached. With Compacted Topics, the information is never deleted. Therefore, they are suitable for long-term data storage.
- Apache Kafka is popular among users mainly because of its ease of use and scalability.
What is Apache Airflow?
Apache Airflow explained with architecture and application examples.
What is the Star Schema?
Description of the star scheme compared to the snowflake scheme.
What is Apache Spark?
Explanation of Apache Spark with a comparison to Hadoop.
What is a Database Schema?
Explanation of database schemas by example.
What is Presto?
Explanation of Apache Presto compared to Apache Spark.
OLTP: What is Online Transaction Processing?
Explanation of OLTP including its features and differences from OLAP.
Overview of important SQL commands
Common SQL commands explained with the help of examples.
OLAP: What is Online Analytical Processing?
Introduction to Online Analytical Processing with an explanation of the OLAP Cube.
What is a YAML File?
Explanation of YAML files and their use in Python.
What is an XML-File?
The XML structure explained with examples and how to open it in Python.