How to decide Kafka Cluster size
I had recently worked with kafka and these are my observations.
Each topic is divided into partitions and all the partitions of a topic are distributed across kafka brokers; first of all these help to save topics whose size is larger than the capacity of a single kafka broker and also they increase the consumer parallelism.
To increase the reliability and fault tolerance,replications of the partitions are made and they do not increase the consumer parallelism.The thumb rule is a single broker can host only a single replica per partition. Hence Number of brokers must be >= No of replicas
All partitions are spread across all the available brokers,number of partitions can be irrespective of number of brokers but number of partitions must be equal to the number of consumer threads in a consumer group(to get best throughput)
The cluster size should be decided keeping in mind the throughput you want to achieve at consumer.
As I understand, getting good throughput from Kafka doesn't depend only on the cluster size; there are others configurations which need to be considered as well. I will try to share as much as I can.
Kafka's throughput is supposed to be linearly scalabale with the numbers of disk you have. The new multiple data directories feature introduced in Kafka 0.8 allows Kafka's topics to have different partitions on different machines. As the partition number increases greatly, so do the chances that the leader election process will be slower, also effecting consumer rebalancing. This is something to consider, and could be a bottleneck.
Another key thing could be the disk flush rate. As Kafka always immediately writes all data to the filesystem, the more often data is flushed to disk, the more "seek-bound" Kafka will be, and the lower the throughput. Again a very low flush rate might lead to different problems, as in that case the amount of data to be flushed will be large. So providing an exact figure is not very practical and I think that is the reason you couldn't find such direct answer in the Kafka documentation.
There will be other factors too. For example the consumer's fetch
size, compressions, batch size for asynchronous producers, socket buffer sizes etc.
Hardware & OS will also play a key role in this as using Kafka in a Linux based environment is advisable due to its pageCache mechanism for writing data to the disk. Read more on this here
You might also want to take a look at how OS flush behavior play a key role into consideration before you actually tune it to fit your needs. I believe it is key to understand the design philosophy, which makes it so effective in terms of throughput and fault-tolerance.
Some more resource I find useful to dig in
- https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
- http://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/
- https://grey-boundary.io/load-testing-apache-kafka-on-aws/
- https://cwiki.apache.org/confluence/display/KAFKA/Performance+testing