What makes Kafka high in throughput?

Lots of details on what makes Kafka different and faster than other messaging systems are in Jay Kreps blog post here

https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

There are actually a lot of differences that make Kafka perform well including but not limited to:

  • Maximized use of sequential disk reads and writes
  • Zero-copy processing of messages
  • Use of Linux OS page cache rather than Java heap for caching
  • Partitioning of topics across multiple brokers in a cluster
  • Smart client libraries that offload certain functions from the brokers
  • Batching of multiple published messages to yield less frequent network round trips to the broker
  • Support for multiple in-flight messages
  • Prefetching data into client buffers for faster subsequent requests.

It's largely marketing that Kafka is fast for a message broker. For example IBM MessageSight appliances did 13M msgs/sec with microsecond latency in 2013. On one machine. A year before Kreps even started the Github.: https://www.zdnet.com/article/ibm-launches-messagesight-appliance-aimed-at-m2m/

Kafka is good for a lot of things. True low latency messaging is not one of them. You flatly can't use batch delivery (e.g. a range of offsets) in any pure latency-centric environment. When an event arrives, delivery must be attempted immediately if you want the lowest latency. That doesn't mean waiting around for a couple seconds to batch read a block of events or enduring the overhead of requesting every message. Try using Kafka with an offset range of 1 (so: 1 message) if you want to compare it to a normal push-based broker and you'll see what I mean.

Instead, I recommend focusing on the thing pull-based stream buffering does give you:

  • Replayability!!!

Personally, I think this makes downstream data engineering systems a bit easier to build in the face of failure, particularly since you don't have to rely on their built-in replication models (if they even have one). For example, it's very easy for me to consume messages, lose the disks, restore the machine, and replay the lost data. The data streams become the single source of truth against which other systems can synchronize and this is exceptionally useful!!!

There's no free lunch in messaging, pull and push each have their advantages and disadvantages vs. each other. It might not surprise you that people have also tried push-pull messaging and it's no free lunch either :).

Tags:

Apache Kafka