Why is random jitter applied to back-off strategies?

Randomization avoids the retries from several calls to happen at the same time.

More information on Exponential Backoff And Jitter can be found here: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

Suppose you have multiple clients that send messages that collide. They all decide to back off. If they use the same deterministic algorithm to decide how long to wait, they will all retry at the same time -- resulting in another collision. Adding a random factor separates the retries.

It smooths traffic on the resource being requested.

If your request fails at a particular time, there's a good chance other requests are failing at almost exactly the same time. If all of these requests follow the same deterministic back-off strategy (say, retrying after 1, 2, 4, 8, 16... seconds), then everyone who failed the first time will retry at almost exactly the same time, and there's a good chance there will be more simultaneous requests than the service can handle, resulting in more failures. This same cluster of simultaneous requests can recur repeatedly, and likely fail repeatedly, even if the overall level of load on the service outside of those retry spikes is small.

By introducing jitter, the initial group of failing requests may be clustered in a very small window, say 100ms, but with each retry cycle, the cluster of requests spreads into a larger and larger time window, reducing the size of the spike at a given time. The service is likely to be able to handle the requests when spread over a sufficiently large window.

Why is random jitter applied to back-off strategies?

Tags:

Algorithm

Random

Exponential Backoff

Related

Recent Posts