Tuesday, December 4, 2012

OpenVZ latest kernels screw networking?

We've been puzzled recently with the overall performance of our networking. We run a mellanox fabric of SX1016 switches (four of them), which are 64 port 10G switches. We trunk three of them with eight ports to the fourth meaning that within a switch you can get up to 560Gbit/s, but between switches you're probably limited to 80Gbit/s. The way our environment is distributed you should get goods from both worlds.

However in practice we see most nodes traffic in the 0.5-1.5Gbit/s range. Which is really really odd. We'd been suspecting the switches etc for a long while and have about 4 different mellanox tickets open including both switch and 10G card firmwares in production now, that were largely created by Mellanox because of our issues.

But today as Ilja was debugging another issue at one of his deployments on 1G he noticed a weird network performance drop. Even with basic tests he couldn't get 1G line speed, not even close. The bugzilla ticket in question: http://bugzilla.openvz.org/show_bug.cgi?id=2454

He tried repeating the test here in our datacenter on some spare older nodes with 1G networking and was able to reproduce the issue, which disappeared with kernel downgrade. He also tested speeds between 1G and 10G and was getting really bad results. So next up we planned to test it inside 10G fabric. I ran a test between two 10G nodes and no matter how I tried I was hard pressed to see more than half a Gbit/s speeds. I then decided to test direct HN <-> VZ container tests as those have been shown to be able to run without overhead so for 10G we should be able to get 9+ Gbit/s easily.

Well that's a nice thought, but this is the actual result:



[root@wn-v-5492 ~]# iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size:  512 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  3] local 10.10.23.164 port 34441 connected with 192.168.2.98 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 5.0 sec   235 MBytes   394 Mbits/sec
[  3]  5.0-10.0 sec   910 MBytes  1.53 Gbits/sec
[  3] 10.0-15.0 sec  1.04 GBytes  1.79 Gbits/sec
[  3] 15.0-20.0 sec  1.04 GBytes  1.79 Gbits/sec
[  3] 20.0-25.0 sec  1.04 GBytes  1.79 Gbits/sec
[  3] 25.0-30.0 sec  1.05 GBytes  1.80 Gbits/sec
[  3]  0.0-30.0 sec  5.29 GBytes  1.52 Gbits/sec
[root@wn-v-5492 ~]# 

Now the interesting thing to notice here is that firstly it takes a moment for the speed to pick up, but we can live with that. Then however it's pretty hard capped at 1.8Gb/s. This node was doing nothing else at the time, therefore it wasn't resource constraining.

Another interesting thing of note here is that if you go back to Ilja's bugzilla post that was done before I even started testing, then there's a nice quote:

"Another symptom - reversing the iperf testing direction (using VM as iperf client, and remote physical server as iperf server) results in even worse results, which are quite consistent: ~180Mbps"


As he was testing on 1Gbit/s networks he saw a throughput of 18%. We just now tested the exact same thing on a totally different hardware and got 18% throughput. That's just too hard to believe to be a coincidence. So we started to look into kernels. The one we were running at that time was: 2.6.32-042stab059.7, which is a rebase of 2.6.32-279.1.1.el6. We downgraded to 2.6.32-042stab053.5, which is a rebase of 2.6.32-220.7.1.el6. Rerunning the test:


iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size:  512 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  3] local 10.10.23.164 port 33264 connected with 192.168.2.98 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 5.0 sec  1.24 GBytes  2.13 Gbits/sec
[  3]  5.0-10.0 sec  1.39 GBytes  2.39 Gbits/sec
[  3] 10.0-15.0 sec  1.42 GBytes  2.43 Gbits/sec
[  3] 15.0-20.0 sec  1.41 GBytes  2.42 Gbits/sec
[  3] 20.0-25.0 sec  1.42 GBytes  2.44 Gbits/sec
[  3] 25.0-30.0 sec  1.42 GBytes  2.44 Gbits/sec
[  3]  0.0-30.0 sec  8.30 GBytes  2.38 Gbits/sec

So, improvement. But not quite line speed yet. Next up we restarted two of the nodes in vanilla CentOS  6.3 liveCD's to test vanilla OS kernels and see if that can get the speed up. Our local RTT times are about 0.05ms meaning that a simple calculation shows that we'll need a basic 64KB TCP window size to get to 10G therefore in theory no tuning is needed for the vanilla kernel:

Window = BW * RTT = 10Gbit/s * 0.05 ms = 1.25 GB/s * 0.05 ms = 1280MB/s * 0.05 ms = 1.28MB /ms * 0.05ms = 0.064 MB = 65.5 KB.

The vanilla kernel usually has tcp_rmem and tcp_wmem of 4K 16K 4M meaning that 16K default window would indeed give about 2.4Gb, but setting the window size larger should give you full 10G. However at least our first tests with vanilla kernel came up with nothing promising. We couldn't get past about 2.5Gbit/s and with multiple streams we were at best hitting 4.5Gbit/s. I'll continue the post when I continue the debugging tomorrow...

1 comment:

  1. To get a bit of background on the IRQ and SMP affinity I suggest reading http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux + links at the end of that entry.

    ReplyDelete