Is Apache Kafka appropriate for use as an unordered task queue?

Using Kafka for a task queue is a bad idea. Use RabbitMQ instead, it does it much better and more elegantly.

Although you can use Kafka for a task queue - you will get some issues: Kafka is not allowing to consume a single partition by many consumers (by design), so if for example a single partition gets filled with many tasks and the consumer who owns the partition is busy, the tasks in that partition will get "starvation". This also means that the order of consumption of tasks in the topic will not be identical to the order which the tasks were produced which might cause serious problems if the tasks needs to be consumed in a specific order (in Kafka to fully achieve that you must have only one consumer and one partition - which means serial consumption by just one node. If you have multiple consumers and multiple partitions the order of tasks consumption will not be guaranteed in the topic level).

In fact - Kafka topics are not queues in the computer science manner. Queue means First in First out - this is not what you get in Kafka in the topic level.

Another issue is that it is difficult to change the number of partitions dynamically. Adding or removing new workers should be dynamic. If you want to ensure that the new workers will get tasks in Kakfa you will have to set the partition number to the maximum possible workers. This is not elegant enough.

So the bottom line - use RabbitMQ or other queues instead.

Having said all of that - Samza (by linkedin) is using kafka as some sort of streaming based task queue: Samza

Edit: scale considerations: I forgot to mention that Kakfa is a big data/big scale tool. If your job rate is huge then Kafka might be good option for you despite the things I wrote earlier, since dealing with huge scale is very challenging and Kafka is very good in doing that. If we are talking about smaller scales (say, up to few dosens/hundreds of jobs per second) then again Kafka is a poor choice compared to RabbitMQ.

There is a lot of discussion in this topic revolving around order of execution of tasks in a work or task queue. I would put forth the notion that order of execution should not be a feature of a work queue.

A work queue is a means of controlling resource usage by applying a controllable number of worker threads towards completion of distinct tasks. Enforcing a processing order on tasks in a queue means you are also enforcing a completion order on tasks in the queue which effectively means that tasks in the queue would always be processed sequentially with the next task being processed only after the END of the preceding task. This effectively means you have a single threaded task queue.

If order of execution is important in some of your tasks, then those tasks should add the next task in the sequence to the work queue upon its completion. Either that or you support a Sequential Job type which when processed actually processes a list of jobs sequentially on one worker.

In no way should the work queue actually order any of its work - the next available processor should always take the next task with no regards to what has occurred prior to or after the task completes.

I was also looking at kafka as a basis for a work queue, but the more I research it, the less it looks like the desired platform.

I see it mainly being used as a means of synchronizing disparate resources and not so much as a means of executing disparate job requests.

Another area that I think is important in a work queue is the support of a prioritization of tasks. For example, if I have 20 tasks in the queue, and a new task arrives with a higher priority, I want that task to jump to the start of the line to be picked up by the next available worker. Kafka would not allow this.

I would say that this depends on the scale. How many tasks do you anticipate in a unit of time?

What you describe as your end goal is basically how Kafka works by default. When you produce messages, default (most widely used) option is to use random partitioner, which chooses partitions in the round robin fashion, keeping partitions evenly used (so it's possible to avoid specifying a partition).
The main purpose of partitions is to parallelize processing of messages, so you should use it in such a manner.
Other commonly used "thing" that partitions are used for is assuring that certain messages get consumed in the same order as they are produced (then you specify partitioning key in such a way that all such messages end up in the same partition. E.g. using userId as key would assure all users are processed in such a way).

Is Apache Kafka appropriate for use as an unordered task queue?

Tags:

Architecture

Message Queue

Apache Kafka

Related

Recent Posts