Beer reviews and other ramblings

Niceties missing from Apache Kafka

Niceties missing from Apache Kafka

Kafka is awesome, but it’s not new. It’s a technology that has been gaining traction for some time, and has been adopted by many of the biggest names in the tech world. But it doesn’t have all the bells and whistles of newer “modern” technologies. Distributed systems are difficult to create, and some are more polished than others. So this is a short, non-exhaustive list of features I’ve found lacking in Kafka.

To be clear, these are niceties. Kafka developers seem to have focused thus far on stabilizing core functionality and maintaining API compatibility, and for that I’m grateful. As of this writing, we’ve encountered no bugs with Kafka itself. It works as advertised, which can be a real relief with software.

I’m hesitant to compare Kafka to Elasticsearch, so this isn’t a comparison. That being said, Elasticsearch is a lot shinier than Kafka, and this post was inspired by some of the things I’ve come to take for granted in Elasticsearch. Obviously their functionality is different, but they are both distributed systems, and Elasticsearch is significantly more user friendly.

Health check

Is the cluster healthy? Good luck finding out. Kafka doesn’t have a native health check API or command. One way to determine health is to run kafka-topics --describe, and compare the total number of replicas for a partition to the number of in-sync replicas (ISR). If there are out of sync replicas then the partition (and thus the cluster) is more prone to data loss. If there is no leader for a partition then the partition is unhealthy and unable to serve requests, and your cluster is unhealthy.

What, you want something simple, like green?

Me too. Where I work (at ZipRecruiter) we use an open source tool, kafka-health-check, which runs on each of your brokers (though not required) and monitors health to determine a health status (green, yellow or red). It then provides a REST API around which you can build monitoring/alerting. It adds essential functionality to Kafka, and it’s free. That being said, it would be best if this were a native feature.

Automatic partition reassignment

There will come a time when you need to scale up your Kafka cluster. So you think “I’ll add more brokers” and call it a day. Except your new brokers aren’t doing anything yet. You’ll need to generate new partition assignments for your topics to include the new brokers, then tell Kafka to execute the partition reassignments to actually move partitions.

Want to check on the status of the partition migration? You’ll need to whip up some code to get a percentage complete, since the existing tools just print a list of in-progress and done partitions. Want estimated time remaining or transfer rates? You’ll need to calculate those yourself based on data from observability tools.

Easy metrics

“How is Kafka performing?”, you might wonder.

“I’ll check the metrics API”, you might think.

Kafka doesn’t have a standard human friendly metrics API. It exports metrics via JMX (Java Management Extensions), which is a non-human friendly protocol designed with distributed debugging and monitoring in mind. It was last updated in 2008*.

So, how do you inspect metrics? You’ll need something to query the JMX endpoint of the Kafka process, then export the metrics to a monitoring system. At work we use jmxtrans (which is open source) to send metrics to Graphite. Other options are DataDog (I’ve heard good things), and the Confluent Control Center which is available only to Enterprise customers ($$$) of Confluent.

Ideally, there should be a kafka-metrics tool which prints some of these metrics to stdout in plain text or json format. There is kafka.tools.JmxTool, but it requires knowing which metrics (MBeans) you care about beforehand.  kafka-metrics could wrap JmxTool with a good list of default metrics to query, and ideally would be updated to optionally produce json.

* old != bad

Modern service management

Software crashes and requires restarting – it’s a fact of life.

Kafka comes with a SysV-style init bash script for managing the Kafka service, and by default the Kafka process is not restarted automatically if it crashes. And yes, it crashes. It’s Java, after all. Eventually you’ll run out of heap space or some other resource (we have run out of mmap pointers in the past).

With a more recent service management utility like Upstart or Systemd, it’s really easy to configure the service to restart automatically. With SystemV, you need to modify /etc/inittab to add a line for Kafka. And then reboot, apparently? It would be nice if configs were included for modern service management systems.

Version command

Which version of Kafka is running? Regrettably, there’s no --version argument for any Kafka executables. There are at least three ways to determine which version of Kafka you have:

1. With JmxTool, which is really designed for printing metrics. For example, this command will print the version (and timestamp) repeatedly until stopped:

2. Consult the package manager of your OS. If you’re using the Confluent Platform distribution of Kafka, you’ll then need to associate the Confluent version with an actual Kafka release.
3. Inspect the Class path of the running process.

The good news is that there’s an open pull request to add a --version command. Unfortunately if you’re on an old version you obviously won’t be able to use it. You can however use that to your advantage, because if you run some-kafka-tool --version and it doesn’t work, you are running a version at least earlier than 1.0.1.

In Conclusion

While I think the above features are important, their (hopefully temporary) absence is not going to prevent me from using, or advocating in favor of, Kafka. It’s a scalable and robust tool, and we’re building a lot of infrastructure with it at work. More on that to come. Probably.



2 thoughts on “Niceties missing from Apache Kafka”

  • Very good post.

    Two comments:
    – systemd scripts will be available with the next Confluent Open Source release, coming in a few weeks.
    – Confluent Auto Databalancer automates the partition rebalancing (but it is only available in the enterprise edition).

    I think some of your other missing features will also come in the next releases.

    • Kai, thanks for the comments! That’s great news about the systemd script. Can I convince you to ship an Upstart script too? And do you have any more details on which of the features will be added in the future?

Leave a Reply

Your email address will not be published. Required fields are marked *