20.05.2021 | Sascha Müller | comment icon 0 Comment

Kafka Cluster Setup with Docker and Docker Compose

Today I’m going to show you how to setup a local Apache Kafka cluster for development using Docker and Docker Compose.

I assume you have a basic understanding of Docker and Docker Compose and already got it installed. So let’s get started.

What is Apache Kafka?

Kafka was initially developed at LinkedIn as a messaging queue. It was then open sourced and given to the Apache Software Foundation in 2011 and quickly became a full-fledged event streaming platform. It is used to store and process data in (almost) real-time, all while being highly scalable and fault tolerant. Apache Kafka combines three key capabilities:

  • to publish and subscribe to streams of events
  • to store streams of events
  • to process streams of events

Kafka is used in a variety of sectors including:

  • Processing payments and other transactions in real-time
  • Track and monitor things, e.g. cars and trucks
  • Capture and analyze sensor data from IoT devices
  • and much more

There are also lots of commercial offerings available, even one from the original developers of Kafka who founded a company especially for that – Confluent. They offer a complete event streaming platform in the cloud without the management overhead (setting up Kafka securely at scale is no easy task). Also they offer advanced tools and services to make data ingestion, integration and processing easier.

Why should you bother?

Publish / Subscribe is a design pattern that provides a framework for exchanging messages between publishers and subscribers. It relies on a message broker that relays messages from the publisher to the subscribers. This allows for loose coupling of services.

Let’s say we were to develop a few services:

  • an account service that lets customers add and change personal data like “address” and “phone number”
  • an invoice service, that takes the data from the account service to generate and send invoices to your customers.

Now, imagine a customer (Alice) changes her invoice address. Right after that, our account service goes down, but your invoice service wants to issue a new invoice to Alice. It tries to reach the account service for the invoice address but fails. If your invoice service has it’s own data source, it can use the address of Alice from there, but that might be outdated, therefore sending the invoice to the wrong address.

Current microservice architecture without Kafka
Current microservice architecture without Kafka

Kafka can help with that in that way, that you can use loose coupling. Alice changes her address and the account service sends this information to Kafka. The invoice service is notified of the new address of Alice through Kafka, no matter if the account service is down or not.

Also, we can now add other services that need to know of Alice’s new address. All services, that are interested in the address change, just need to register with Kafkas “address”-topic and will get notified in case of a change. That way, adding more services won’t put extra pressure on the account service, ultimately leading to scalability. 

Proposed architecture to mitigate downtime
Proposed architecture to mitigate downtime

Kafka can also be used to take stress off a system. If a producer produces more messages than a consuming system can handle, Kafka can be used to store and therefor “buffer” these messages until the consumer can keep up again or you can spin up another instance. This happens a lot in Big Data applications and microservice or IoT architectures, where some systems produce a lot of messages and others do the “heavy lifting”.

Setup a Kafka cluster

Kafka uses Apache Zookeeper for storing meta information about topics, for leader election1 and so on. Communication between Zookeeper and the brokers relies on docker network, so make sure they use the same network.
I’m going to use pre-built docker images from bitnami as they are very stable and up to date.

Let’s start with a simple docker-compose.yml file:

As you can see, we create a new network “kafka-cluster” and a service “zookeeper”. Zookeeper uses the aforementioned network. We set the container_name to “zookeeper” and allow access to port 2181.

Next, we’re going to add one Kafka broker with the name kafka1 to the services section:

We tell Kafka where Zookeeper can be reached and that it should wait on Zookeeper to start up (depends_on: – zookeeper). Also “kafka1” uses the same network as “zookeeper”. The other configuration options define where Kafka can be reached and how. There are a lot more settings you can tinker with, but that’ll do for now.

You could run your cluster with “docker-compose up -d” now, but it wouldn’t be much fun. Only one Kafka server doesn’t make a cluster, right?

So, we’re going to add two more brokers to the docker-compose.yml file and also a little management tool called Kafdrop. Kafdrop is a web UI that displays information such as brokers, topics, partitions and consumers. It also lets you manage the cluster to a certain extend.

Kafdrop UI
Kafdrop UI

Here’s the complete docker-compose file:

Pay attention to one little detail: in the environment section of the Kafka brokers we gave each broker a different port to use on the host system. Kafdrop uses these to connect to each broker individually as will your producers and consumers.

Setting up a Kafka cluster on individual machines however is a whole different level as you have to pay special attention to network configuration and network speed, broker timings, disk-space and so much more. It’s also highly dependent on your topics, the number of topics and the size of your messages etc. and is far beyond this introduction.

For example, if you set “log.flush.interval.ms” to a high value, more messages get stored in your RAM, before they are written to disk. Setting this to a low value, your RAM usage will be lower, but your disk will get more stressed.

This is just one of over a hundred different settings to take care and be aware of.

Run the Kafka cluster

If you issue a “docker-compose up -d” now, Zookeeper, the Kafka brokers and Kafdrop should start up and should shortly be ready to use. Type http://localhost:9000 into your browser to go to the Kafdrop UI.

Next up in this series: “How to create a Java Producer and Consumer”. This is what what Kafka clients are called – producers and consumers. There will also be a post about producers and consumers with Spring Boot.

See you next time.

1: Kafka manages topics which are divided into separate partitions – meaning a topic is spread over a number of “buckets” on different brokers. Each partition has exactly one partition leader which handles all read/write requests for that partition. If that leader fails for any reason the cluster needs to elect a new leader for that partition. This process is called leader election. You can read more about that topic here: http://kafka.apache.org/documentation/#replication

apache zookeeper cluster docker docker compose kafka

Leave a Comment