How to tune TCP for high-frequency connections between two nodes
In our data center we have an F5 running on BigIP hardware that acts as single ingress point for HTTPS requests from client machines in our various office locations across the country.
If this single point (front-end) remains single when it passes connections down to back-end why are you wondering about the hiccups? Specially if intensity of connections is "possibly 100+ per second".
Your setup is basically squeezing one set with higher cardinality into another one with cardinality significantly lower.
ultimately only reduce the chance of those "collisions"
This is put into basis of how packet switched networks work. Say, on Ethernet level there're collisions too. Randomness is inevitable and TCP/IP is dealing with it. The IP protocol itself was built not with LANs in mind, actually (but still works great there too).
So yes "Add more source IPs and/or make Traefik listen on multiple ports" is pretty reasonable way to follow.
Although I also think adding more IP addresses is the simplest way forward, have you considered exploring reusing TCP connections between the F5 and the Traefik nodes instead of creating a new one per external request?
I'm not sure how F5 supports that, but maybe it's as simple as switching to http2 between the F5 and the Traefik nodes. See https://developers.google.com/web/fundamentals/performance/http2#one_connection_per_origin
Turns out there was a very simple solution to this problem after all, which we figured out after working with the Traefik vendor for a while. Turns out also that the fact that we are running Traefik in Docker does matter. The problem and solution is very specific to our setup but I still want to document here it in case others should encounter the same. Nevertheless, this does not invalidate the other, more general recommendations as collisions of instance IDs are a real problem.
Long story short: All Traefik instances are configured as host-constrained containers (i.e. tied to specific hosts) running in a Docker Swarm cluster. Traefik instances need to expose a port at host level so that they become reachable from the F5, which obviously is not a Docker Swarm participant. Those exposed ports had been configured in ingress mode, which was not only unnecessary (no need to route traffic through the Docker Swarm ingress network) but was also the cause for the dropped/ignored SYN packets. Once we switched the port mode to host, the delays disappeared.
Before:
ports:
- target: 8080
published: 8080
protocol: tcp
mode: ingress
After:
ports:
- target: 8080
published: 8080
protocol: tcp
mode: host