Effective strategy to avoid duplicate messages in apache kafka consumer
The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.
This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
- Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
- Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
- Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
- The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon
I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$'
)
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.
There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once