Guidelines to handle Timeout exception for Kafka Producer?
"What are the general causes of these Timeout exceptions?"
The most common cause that I saw earlier was due to staled metadata information: one broker went down, and the topic partitions on that broker were failed over to other brokers. However, the topic metadata information has not been updated properly, and the client still tries to talk to the failed broker to either get metadata info, or to publish the message. That causes timeout exception.
Netwowrk connectivity issues. This can be easily diagnosed with
telnet broker_host borker_port
The broker is overloaded. This can happen if the broker is saturated with high workload, or hosts too many topic partitions.
To handle the timeout exceptions, the general practice is:
Rule out broker side issues. make sure that the topic partitions are fully replicated, and the brokers are not overloaded
Fix host name resolution or network connectivity issues if there are any
Tune parameters such as
request.timeout.ms
,delivery.timeout.ms
etc. My past experience was that the default value works fine in most of the cases.
The default Kafka config values, both for producers and brokers, are conservative enough that, under general circumstances, you shouldn't run into any timeouts. Those problems typically point to a flaky/lossy network between the producer and the brokers.
The exception you're getting, Failed to update metadata
, usually means one of the brokers is not reachable by the producer, and the effect is that it cannot get the metadata.
For your second question, Kafka will automatically retry to send messages that were not fully ack'ed by the brokers. It's up to you if you want to catch and retry when you get a timeout on the application side, but if you're hitting 1+ min timeouts, retrying is probably not going to make much of a difference. You're going to have to figure out the underlying network/reachability problems with the brokers anyway.
In my experience, usually the network problems are:
- Port 9092 is blocked by a firewall, either on the producer side or on the broker side, or somewhere in the middle (try
nc -z broker-ip 9092
from the server running the producer) - DNS resolution is broken, so even though the port is open, the producer cannot resolve to an IP address.
I suggest to use the following properties while constructing Producer config
Need acks from Partition - Leader
kafka.acks=1
Maximum number fo retries kafka producer will do to send message and recieve acks from Leader
kafka.retries=3
Request timeout for each indiviual request
timeout.ms=200
Wait to send next request again ; This is to avoid sending requests in tight loop;
retry.backoff.ms=50
Upper bound to finish all the retries
dataLogger.kafka.delivery.timeout.ms=1200
producer.send(record, new Callback {
override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
if (e != null) {
logger.debug(s"KafkaLogger : Message Sent $record to Topic ${recordMetadata.topic()}, Partition ${recordMetadata.partition()} , Offset ${recordMetadata.offset()} ")
} else {
logger.error(s"Exception while sending message $item to Error topic :$e")
}
}
})
Close the Producer with timeout
producer.close(1000, TimeUnit.MILLISECONDS)