Improving TCP performance over a gigabit network with lots of connections and high traffic of small packets
Solution 1:
The problem might be that you are getting too many interrupts on your network card. If Bandwidth is not the problem, frequency is the problem:
Turn up send/receive buffers on the network card
ethtool -g eth0
Will show you the current settings (256 or 512 entries). You can probably raise these to 1024, 2048 or 3172. More does probably not make sense. This is just a ring buffer that only fills up if the server is not able to process incoming packets fast enough.
If the buffer starts to fill, flow control is an additional means to tell the router or switch to slow down:
Turn on flow control in/outbound on the server and the switch/router-ports it is attached to.
ethtool -a eth0
Will probably show:
Pause parameters for eth0:
Autonegotiate: on
RX: on
TX: on
Check /var/log/messages for the current setting of eth0. Check for something like:
eth0: Link is up at 1000 Mbps, full duplex, flow control tx and rx
If you don't see tx and rx your network admins have to adjust the values on the switch/router. On Cisco that is receive/transmit flow control on.
Beware: Changing these Values will bring your link down and up for a very short time (less than 1s).
If all this does not help - you can also lower the speed of the network card to 100 MBit (do the same on the switch/router-ports)
ethtool -s eth0 autoneg off && ethtool -s eth0 speed 100
But in your case I would say - raise the receive buffers in the NIC ring buffer.
Solution 2:
Following might not be the definitive answer but it will definitely put forth some ideas
Try adding these to sysctl.conf
## tcp selective acknowledgements.
net.ipv4.tcp_sack = 1
##enable window scaling
net.ipv4.tcp_window_scaling = 1
##
net.ipv4.tcp_no_metrics_save = 1
While selective tcp ack is good for optimal performance in the case of high bandwidth network . But beware of other drawbacks though. Benefits of window scaling is described here. As for third sysctl option: By default, TCP saves various connection metrics in the route cache when the connection closes, so that connections established in the near future can use these to set initial conditions. Usually, this increases overall performance, but may sometimes cause performance degradation. If set, TCP will not cache metrics on closing connections.
Check with
ethtool -k ethX
to see if offloading is enabled or not. TCP checksum offload and large segment offload are supported by the majority of today's Ethernet NICs and apparently Broadcom also supports it.
Try using tool
powertop
while network is idle and when the network saturation is reached. This will definitely show if NIC interrupts are the culprit. Device polling is an answer to such situation. FreeBsd supports polling switch right inside ifconfig but linux has no such option. Consult this to enable polling. It is saying BroadCom also supports polling which is good news for you.
Jumbo packet tweak might not cut it for you since you mentioned your traffic consitutes mostly of small packets. But hey try it out anyway !
Solution 3:
I noticed in the list of tweaks that timestamps is turned off, please do not do that. That is an old throwback to days of yore when bandwidth was really expensive and people wanted to save a few bytes/packet. It is used, for example, by the TCP stack these days to tell if a packet arriving for a socket in "CLOSE_WAIT" is an old packet for the connection or if it is a new packet for a new connection and helps in RTT calculations. And saving the few bytes for a timestamp is NOTHING compared to what IPv6 addresses are going to add. Turning off timestamps does more harm than good.
This recommendation for turning off timestamps is just a throwback that keeps getting passed from one generation of sysadmin to the next. Sort of an "urban legend" sort of thing.
Solution 4:
you need to distribute the load across all CPU cores. Start 'irqbalance'.
Solution 5:
In my case only a single tuninng:
net.ipv4.tcp_timestamps = 0
made a very big and useful change, site loading time descreased by 50%.