HAProxy graceful reload with zero packet loss
Solution 1:
According to https://github.com/aws/opsworks-cookbooks/pull/40 and consequently http://www.mail-archive.com/[email protected]/msg06885.html you can:
iptables -I INPUT -p tcp --dport $PORT --syn -j DROP
sleep 1
service haproxy restart
iptables -D INPUT -p tcp --dport $PORT --syn -j DROP
This has the effect of dropping the SYN before a restart, so that clients will resend this SYN until it reaches the new process.
Solution 2:
Yelp shared a more sophisticated approach based on meticulous testing. The blog article is a deep dive, and well worth the time investment to fully appreciate it.
True Zero Downtime HAProxy Reloads
tl;dr use Linux tc (traffic control) and iptables to temporarily queue SYN packets while HAProxy is reloading and has two pids attached to the same port (SO_REUSEPORT
).
I'm not comfortable re-publishing the entire article on ServerFault; nevertheless, here are a few excerpts to pique your interest:
By delaying SYN packets coming into our HAProxy load balancers that run on each machine, we are able to minimally impact traffic during HAProxy reloads, which allows us to add, remove, and change service backends within our SOA without fear of significantly impacting user traffic.
# plug_manipulation.sh
nl-qdisc-add --dev=lo --parent=1:4 --id=40: --update plug --buffer
service haproxy reload
nl-qdisc-add --dev=lo --parent=1:4 --id=40: --update plug --release-indefinite
# setup_iptables.sh
iptables -t mangle -I OUTPUT -p tcp -s 169.254.255.254 --syn -j MARK --set-mark 1
# setup_qdisc.sh
## Set up the queuing discipline
tc qdisc add dev lo root handle 1: prio bands 4
tc qdisc add dev lo parent 1:1 handle 10: pfifo limit 1000
tc qdisc add dev lo parent 1:2 handle 20: pfifo limit 1000
tc qdisc add dev lo parent 1:3 handle 30: pfifo limit 1000
## Create a plug qdisc with 1 meg of buffer
nl-qdisc-add --dev=lo --parent=1:4 --id=40: plug --limit 1048576
## Release the plug
nl-qdisc-add --dev=lo --parent=1:4 --id=40: --update plug --release-indefinite
## Set up the filter, any packet marked with “1” will be
## directed to the plug
tc filter add dev lo protocol ip parent 1:0 prio 1 handle 1 fw classid 1:4
Gist: https://gist.github.com/jolynch/97e3505a1e92e35de2c0
Cheers to Yelp for sharing such amazing insights.
Solution 3:
There is another much simpler way to reload haproxy with true zero downtime - it is named iptables flipping (the article is actually Unbounce response to Yelp solution). It is cleaner than accepted answer as there is no need to drop any packets which may cause problems with long reloads.
Briefly, the solution consists of the following steps:
- Let's have a pair of haproxy instances - the first active which receives a traffic and the second in standby which does not receive any traffic.
- You reconfigure (reload) standby instance at any time.
- Once standby is ready with new config you divert all NEW connections to standby node which becomes new active. Unbounce provides bash script which does the flip with few simple
iptable
commands. - For a moment you have two active instances. You need to wait till opened connections to old active will cease. The time depends on your service behaviour and keep-alive settings.
- Traffic to old active stops which becomes new standby - you are back in step 1.
Moreover the solution can be adopted to any kind of service (nginx, apache etc) and is more fault tolerant as you can test standby configuration before it goes online.
Solution 4:
Edit: My answer makes the assumption that the kernel only sends traffic to the most recent port to be opened with SO_REUSEPORT, whereas it actually sends traffic to all processes as described in one of the comments. In other words, the iptables dance is still required. :(
If you're on a kernel that supports SO_REUSEPORT, then this problem shouldn't happen.
The process that haproxy takes when it restarts is:
1) Try setting SO_REUSEPORT when opening the port (https://github.com/haproxy/haproxy/blob/3cd0ae963e958d5d5fb838e120f1b0e9361a92f8/src/proto_tcp.c#L792-L798)
2) Try opening the port (will succeed with SO_REUSEPORT)
3) If it didn't succeed, signal the old process to close its port, wait 10ms and try it all again. (https://github.com/haproxy/haproxy/blob/3cd0ae963e958d5d5fb838e120f1b0e9361a92f8/src/haproxy.c#L1554-L1577)
It was first supported in the Linux 3.9 kernel but some distros have backported it. For example, EL6 kernels from 2.6.32-417.el6 support it.
Solution 5:
I'll explain my setup and how I solved the graceful reloads:
I have a typical setup with 2 nodes running HAproxy and keepalived. Keepalived tracks interface dummy0, so I can do a "ifconfig dummy0 down" to force switch over.
The real problem is that, I don't know why, a "haproxy reload" still drops all the ESTABLISHED connections :( I tried the "iptables flipping" proposed by gertas, but I found some issues because it performs a NAT on the destination IP address, which is not a suitable solution in some scenarios.
Instead, I decided to use a CONNMARK dirty hack to mark packets belonging to NEW connections, and then redirect those marked packets to the other node.
Here's the iptables ruleset:
iptables -t mangle -A PREROUTING -i eth1 -d 123.123.123.123/32 -m conntrack --ctstate NEW -j CONNMARK --set-mark 1
iptables -t mangle -A PREROUTING -j CONNMARK --restore-mark
iptables -t mangle -A PREROUTING -i eth1 -p tcp --tcp-flags FIN FIN -j MARK --set-mark 2
iptables -t mangle -A PREROUTING -i eth1 -p tcp --tcp-flags RST RST -j MARK --set-mark 2
iptables -t mangle -A PREROUTING -i eth1 -m mark ! --mark 0 -j TEE --gateway 192.168.0.2
iptables -t mangle -A PREROUTING -i eth1 -m mark --mark 1 -j DROP
First two rules mark the packets belonging to the new flows (123.123.123.123 is the keepalived VIP used on the haproxy to bind the frontends on).
Third and fourth rules mark packets FIN/RST packets. (I don't know why, TEE target "ignores" FIN/RST packets).
Fifth rule sends a duplicate of all marked packets to the other HAproxy (192.168.0.2).
Sixth rule drops packets belonging to new flows to prevent reaching their original destination.
Remember to disable rp_filter on interfaces or kernel will drop those martian packets.
And last but not least, mind the returning packets! In my case there is asymmetric routing (requests come to client -> haproxy1 -> haproxy2 -> webserver, and replies go from webserver -> haproxy1 -> client), but it doesn't affect. It works fine.
I know the most elegant solution would be to use iproute2 to do the divert, but it only worked for the first SYN packet. When it received the ACK (3rd packet of the 3-way handshake), it didn't marked it :( I couldn't spend much time to investigate, as soon as I saw it works with TEE target, it left it there. Of course, feel free to try it with iproute2.
Basically, the "graceful reload" works like this:
- I enable the iptables ruleset and immediately see the new connections going to the other HAproxy.
- I keep an eye on "netstat -an | grep ESTABLISHED | wc -l" to supervise the "draining" process.
- Once there are just a few (or zero) connections, "ifconfig dummy0 down" to force keepalived to failover, so all traffic will go to the other HAproxy.
- I remove the iptables ruleset
- (Only for "non-preempting" keepalive config) "ifconfig dummy0 up".
The IPtables ruleset can be easily integrated into a start/stop script:
#!/bin/sh
case $1 in
start)
echo Redirection for new sessions is enabled
# echo 0 > /proc/sys/net/ipv4/tcp_fwmark_accept
for f in /proc/sys/net/ipv4/conf/*/rp_filter; do echo 0 > $f; done
iptables -t mangle -A PREROUTING -i eth1 ! -d 123.123.123.123 -m conntrack --ctstate NEW -j CONNMARK --set-mark 1
iptables -t mangle -A PREROUTING -j CONNMARK --restore-mark
iptables -t mangle -A PREROUTING -i eth1 -p tcp --tcp-flags FIN FIN -j MARK --set-mark 2
iptables -t mangle -A PREROUTING -i eth1 -p tcp --tcp-flags RST RST -j MARK --set-mark 2
iptables -t mangle -A PREROUTING -i eth1 -m mark ! --mark 0 -j TEE --gateway 192.168.0.2
iptables -t mangle -A PREROUTING -i eth1 -m mark --mark 1 -j DROP
;;
stop)
iptables -t mangle -D PREROUTING -i eth1 -m mark --mark 1 -j DROP
iptables -t mangle -D PREROUTING -i eth1 -m mark ! --mark 0 -j TEE --gateway 192.168.0.2
iptables -t mangle -D PREROUTING -i eth1 -p tcp --tcp-flags RST RST -j MARK --set-mark 2
iptables -t mangle -D PREROUTING -i eth1 -p tcp --tcp-flags FIN FIN -j MARK --set-mark 2
iptables -t mangle -D PREROUTING -j CONNMARK --restore-mark
iptables -t mangle -D PREROUTING -i eth1 ! -d 123.123.123.123 -m conntrack --ctstate NEW -j CONNMARK --set-mark 1
echo Redirection for new sessions is disabled
;;
esac