In the world of streaming, Kafka made its way to the Big Data hall of fame. Kafka 0.10 introduced Kafka Streams, a simple stream processing library (yes, not a framework, but a really simple library). This post intends to demonstrate how easy can be to build a stream processing application with Kafka Streams.
The increased popularity of Kafka is mainly due not only to its simplicity, but also to the way streams are modeled as a distributed persisted commit log. Kafka Server will handle the replication, partitioning and data distribution for you and it scales extremely well.
I’ve been using Kafka in production for the past two years. During these years, Kafka evolved a lot, specially in terms of the API. The introduction of Kafka Streams adds only one more layer for Kafka stream processing as a simple library, providing the common primitives for dealing with data (e.g. transformation, filtering, map-reduce). As expected, the ability to read and write data to Kafka topics using this processing library is incredibly simple.
Stream processing these days
Today we can see an explosion of frameworks for real-time stream processing, being the most popular Apache projects (e.g. Apache Storm, Apache Samza, Apache Flink, Apache Spark). They all have their particularities, advantages and drawbacks and they mostly differ on the complexity of the API, the ability to consume from different sources, the pre-integration with other Big Data frameworks (e.g. Hadoop) and the real main difference, from my point of view, lie on deployment and management of clustered processes.
The main difficulty our teams faced with these kind of frameworks was how to integrate them with existing infrastructure automation tools (e.g. Chef) to be able to perform production releases with the click of a button, at least when compared to the deployment of regular services.
If you look at your company’s streaming big picture and you see that all kinds of messaging are being taken care by Kafka, most likely Kafka Streams will be enough for processing those streams and scale at will. Kafka will handle the distribution of data through the multiple applications and will reassign the partitions in the case of a node crash. Obviously you will have to take care of some parts of resiliency yourself (e.g. retries).
In my view, this is a trade-off. If you are using a framework like Apache Storm, you are taking advantage of Nimbus managing the processes across a cluster automatically (e.g. process placement, restart of dead nodes, etc), but that also comes with the cost of a more complex deployment infrastructure to build and manage. Even if you need that kind of dynamics, you can still use Kafka Streams combined with general-purpose resource management frameworks like Kubernetes or Mesos, specially if you are already using those kind of frameworks for something else.
What we are building?
Let’s build the “Hello World” of stream processing applications (Word Count) using Kafka Streams. We will have a simple application reading from a kafka topic, counting words and publish a count per word in an output topic. I used Docker to run Kafka and Zookeeper servers, but feel free to try it in any other environment at your choice.
So if you are using Docker, you can just create this compose file that uses Wurstmeister’s Kafka and Zookeeper images.
To start Kafka and Zookeeper, just start the compose environment.
You may want to start it in the background mode, if that’s the case, just append the -d flag to the above command.
Kafka Streams Word Counter
So first we need to setup our Maven dependencies properly. We will just need the kafka-streams dependency, as below:
And then we create a simple class with a main function that will do the following:
- Consume a string from the topic called words-topic
- Split it in multiple words
- Map each word as a key
- Count by key
- Transform the values to a string in format word:count (just because)
- Publish the result to the topic called counts-topic
With a small amount of code and setup we built a stream processing application. If we have two of these applications running and we kill one of them on purpose, Kafka will automatically failover the partitions of the dead node to the other one.
This application lives as a standalone application as if it was a simple web service that can be easily started/stopped, but reading from a stream instead of processing HTTP requests.
It’s far simpler to build a Chef recipe for deploying this application than it is to deploy an entire Storm cluster and submitting the topologies into it upon a release, but as I mentioned, they are different things and a framework like Storm provide many more features, so it’s perfectly legit if you want to do that.
To easily validate the above application as a proof of concept, one can use the kafkacat command, as below.
For unit testing, I’d recommend taking a look at Testing Topologies in Kafka Streams by @madewithtea, who built a simple library in Scala that allows one to create unit tests for streaming applications without launching a full or embedded Kafka cluster.
In this post we saw how easy is to build a stream processing application using Kafka Streams. It’s a tradeoff whether we use Kafka Streams or a full stream processing framework, depending on many factors (deployment dynamics, stream sources, etc). I’d say that will mostly depend on the use case, the streaming big picture of your company and the business impact of changing and maintaining each of solutions.
If you are dealing only with Kafka for messaging, probably it’s worth considering Kafka Streams as an option as it’s simple to build and deploy. Of course it does not offer the facilities of process management as a framework like Apache Storm does, but if your case does not need that, this way is a good approach to keep it simple.
Streaming out. Have fun!
subscribe via RSS