Wednesday, December 5, 2012

Network debugging for 10Gbit/s

So today we decided to take the network testing seriously. We connected two servers with 1G network to the generic infrastructure (to access software from internet) and then a direct cable between them for 10G. This way we should be able to test exclusively the OS + driver + NIC fw level that we can get to 10Gbit and from there expand the test to include a switch.

To make sure we didn't have any stale software we reinstalled both nodes with CentOS 6.3 with no kernel tunings. We also compared sysctl -a output on the two servers and though there were minor differences none of them should prove relevant.

We then launched a few baseline tests. First of all running iperf locally inside the server to see how much the server itself can handle. With 256KB window size both did at least 17Gbit/s


[root@wn-d-01 ~]# iperf -w 256k -c 192.168.2.1 -i 1
------------------------------------------------------------
Client connecting to 192.168.2.1, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  3] local 192.168.2.1 port 55968 connected with 192.168.2.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  1.77 GBytes  15.2 Gbits/sec
[  3]  1.0- 2.0 sec  2.37 GBytes  20.4 Gbits/sec
[  3]  2.0- 3.0 sec  2.42 GBytes  20.8 Gbits/sec
[  3]  3.0- 4.0 sec  2.40 GBytes  20.6 Gbits/sec
[  3]  4.0- 5.0 sec  2.39 GBytes  20.5 Gbits/sec
[  3]  5.0- 6.0 sec  2.39 GBytes  20.5 Gbits/sec
[  3]  6.0- 7.0 sec  2.39 GBytes  20.6 Gbits/sec
[  3]  7.0- 8.0 sec  2.38 GBytes  20.5 Gbits/sec
[  3]  8.0- 9.0 sec  2.39 GBytes  20.5 Gbits/sec
[  3]  9.0-10.0 sec  2.38 GBytes  20.5 Gbits/sec
[  3]  0.0-10.0 sec  23.3 GBytes  20.0 Gbits/sec
[root@wn-d-01 ~]#

Running then the test between them firstly on the 1Gbit/s to see that there is no generic foul play at work that cuts speed to 25% or 33% we saw nicely 942Mbit/s speeds:


[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.2.98
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  3] local 192.168.2.1 port 53338 connected with 192.168.2.98 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   112 MBytes   943 Mbits/sec
[  3]  1.0- 2.0 sec   112 MBytes   942 Mbits/sec
[  3]  2.0- 3.0 sec   112 MBytes   942 Mbits/sec
[  3]  3.0- 4.0 sec   112 MBytes   942 Mbits/sec
[  3]  4.0- 5.0 sec   112 MBytes   942 Mbits/sec
[  3]  5.0- 6.0 sec   112 MBytes   941 Mbits/sec
[  3]  6.0- 7.0 sec   112 MBytes   942 Mbits/sec
[  3]  7.0- 8.0 sec   112 MBytes   942 Mbits/sec
[  3]  8.0- 9.0 sec   112 MBytes   942 Mbits/sec
[  3]  9.0-10.0 sec   112 MBytes   942 Mbits/sec
[  3]  0.0-10.0 sec  1.10 GBytes   942 Mbits/sec
[root@wn-d-01 ~]# 

So now we fired up the default kernel driver for our Mellanox 10G card and tested the link firstly with ping:


[root@wn-d-98 ~]# ping 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.725 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.177 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.187 ms

So an RTT of 0.2ms means that with 256KB window size you get 9.8Gbit/s. So let's see if that actually works (remember, it's direct attached cable, nothing else running on the servers):


[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  3] local 192.168.10.1 port 51131 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   290 MBytes  2.43 Gbits/sec
[  3]  1.0- 2.0 sec   296 MBytes  2.49 Gbits/sec
[  3]  2.0- 3.0 sec   191 MBytes  1.61 Gbits/sec
[  3]  3.0- 4.0 sec   320 MBytes  2.68 Gbits/sec
[  3]  4.0- 5.0 sec   232 MBytes  1.95 Gbits/sec
[  3]  5.0- 6.0 sec   161 MBytes  1.35 Gbits/sec
[  3]  6.0- 7.0 sec   135 MBytes  1.13 Gbits/sec
[  3]  7.0- 8.0 sec   249 MBytes  2.09 Gbits/sec
[  3]  8.0- 9.0 sec   224 MBytes  1.88 Gbits/sec
[  3]  9.0-10.0 sec   182 MBytes  1.53 Gbits/sec
[  3]  0.0-10.0 sec  2.23 GBytes  1.91 Gbits/sec
[root@wn-d-01 ~]#

Not even close. So next up we installed Mellanox official kernel modules. With those we could also increase the window size to 1-2MB etc (which the default somehow capped at 256KB). Though this shouldn't matter. The first test looked promising the first few seconds:


[root@wn-d-01 ~]# iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 192.168.10.1 port 58336 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  1021 MBytes  8.57 Gbits/sec
[  3]  1.0- 2.0 sec  1.10 GBytes  9.47 Gbits/sec
[  3]  2.0- 3.0 sec  1.10 GBytes  9.47 Gbits/sec
[  3]  3.0- 4.0 sec  1.10 GBytes  9.47 Gbits/sec
[  3]  4.0- 5.0 sec   933 MBytes  7.82 Gbits/sec
[  3]  5.0- 6.0 sec   278 MBytes  2.33 Gbits/sec
[  3]  6.0- 7.0 sec   277 MBytes  2.32 Gbits/sec
[  3]  7.0- 8.0 sec   277 MBytes  2.32 Gbits/sec
[  3]  8.0- 9.0 sec   276 MBytes  2.32 Gbits/sec
[  3]  9.0-10.0 sec   277 MBytes  2.33 Gbits/sec

No matter how hard we tried we couldn't repeat the 9.47Gb/s speeds. Digging into Mellanox network performance tuning guide I first set the default kernel parameters according to them to higher values however that had absolutely no impact on throughput.

The tunings they recommend:

# disable TCP timestamps
sysctl -w net.ipv4.tcp_timestamps=0

# Disable the TCP selective acks option for better CPU utilization:
sysctl -w net.ipv4.tcp_sack=0

# Increase the maximum length of processor input queues:
sysctl -w net.core.netdev_max_backlog=250000

# Increase the TCP maximum and default buffer sizes using setsockopt():
sysctl  -w net.core.rmem_max=16777216
sysctl  -w net.core.wmem_max=16777216
sysctl  -w net.core.rmem_default=16777216
sysctl  -w net.core.wmem_default=16777216
sysctl  -w net.core.optmem_max=16777216

# Increase memory thresholds to prevent packet dropping:
sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"

# Increase Linux’s auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are:
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Enable low latency mode for TCP:
sysctl -w net.ipv4.tcp_low_latency=1

However what did impact somewhat is turning off adaptive interrupt moderation. Though only for a short time. We're getting 7Gbit/s from one node to the other, but the other direction was able to do 7 Gbit/s only for a few seconds before hiccuping and going down to 2.3Gbit/s again:


iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 192.168.10.1 port 58341 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   856 MBytes  7.18 Gbits/sec
[  3]  1.0- 2.0 sec   855 MBytes  7.17 Gbits/sec
[  3]  2.0- 3.0 sec   879 MBytes  7.37 Gbits/sec
[  3]  3.0- 4.0 sec   902 MBytes  7.57 Gbits/sec
[  3]  4.0- 5.0 sec   854 MBytes  7.16 Gbits/sec
[  3]  5.0- 6.0 sec   203 MBytes  1.71 Gbits/sec
[  3]  6.0- 7.0 sec   306 MBytes  2.56 Gbits/sec
[  3]  7.0- 8.0 sec   852 MBytes  7.15 Gbits/sec
[  3]  8.0- 9.0 sec   799 MBytes  6.70 Gbits/sec
[  3]  9.0-10.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 10.0-11.0 sec   323 MBytes  2.71 Gbits/sec
[  3] 11.0-12.0 sec   278 MBytes  2.33 Gbits/sec
[  3] 12.0-13.0 sec   277 MBytes  2.32 Gbits/sec
[  3] 13.0-14.0 sec   277 MBytes  2.32 Gbits/sec
...

Reading further I changed the adaptive interrupt moderation to this:

ethtool -C eth2 adaptive-rx off rx-usecs 32 rx-frames 32

Running two parallel streams and bidirectional test gives a pretty good result.

wn-d-98 as client, wn-d-01 as server:

# iperf -w 1M -i 5 -t 20 -c 192.168.10.1 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.1, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  4] local 192.168.10.2 port 56125 connected with 192.168.10.1 port 5001
[  3] local 192.168.10.2 port 56126 connected with 192.168.10.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 5.0 sec  2.76 GBytes  4.73 Gbits/sec
[  3]  0.0- 5.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM]  0.0- 5.0 sec  5.51 GBytes  9.47 Gbits/sec
[  4]  5.0-10.0 sec  2.76 GBytes  4.74 Gbits/sec
[  3]  5.0-10.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM]  5.0-10.0 sec  5.51 GBytes  9.47 Gbits/sec
[  4] 10.0-15.0 sec  2.76 GBytes  4.73 Gbits/sec
[  3] 10.0-15.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM] 10.0-15.0 sec  5.51 GBytes  9.47 Gbits/sec
[  4] 15.0-20.0 sec  2.76 GBytes  4.74 Gbits/sec
[  4]  0.0-20.0 sec  11.0 GBytes  4.74 Gbits/sec
[  3]  0.0-20.0 sec  11.0 GBytes  4.74 Gbits/sec
[SUM]  0.0-20.0 sec  22.1 GBytes  9.47 Gbits/sec

The other direction:

------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 192.168.10.1 port 58460 connected with 192.168.10.2 port 5001
[  4] local 192.168.10.1 port 58459 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 5.0 sec  2.76 GBytes  4.74 Gbits/sec
[  4]  0.0- 5.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM]  0.0- 5.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3]  5.0-10.0 sec  2.76 GBytes  4.74 Gbits/sec
[  4]  5.0-10.0 sec  2.76 GBytes  4.73 Gbits/sec
[SUM]  5.0-10.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 10.0-15.0 sec  2.76 GBytes  4.74 Gbits/sec
[  4] 10.0-15.0 sec  2.76 GBytes  4.73 Gbits/sec
[SUM] 10.0-15.0 sec  5.52 GBytes  9.47 Gbits/sec
[  3] 15.0-20.0 sec  2.76 GBytes  4.74 Gbits/sec
[  3]  0.0-20.0 sec  11.0 GBytes  4.74 Gbits/sec
[  4] 15.0-20.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM] 15.0-20.0 sec  5.52 GBytes  9.48 Gbits/sec
[  4]  0.0-20.0 sec  11.0 GBytes  4.73 Gbits/sec
[SUM]  0.0-20.0 sec  22.1 GBytes  9.47 Gbits/sec

So with two streams we can saturate the network. Testing again with 1 stream we got d-98 -> d-01 at 9.5Gb/s, but the reverse was at 2.3 Gb/s. Running d-01 -> d-98 at -P 2 got to 9.5G again. Bizarre. The first test now is to see what happens after reboot.

After reboot we see pretty much the same state. Single stream sucks, two streams get 9.5G at least initially and then speeds slow down if you run them parallel in both directions. The tuning script was not made restart safe exactly to see if it had any effect. Same for the adaptive interrupts. Setting both up again we get back to the previous state where single stream speeds are 2-3Gbit/s when run at the same time and two parallel streams get 9.5G in both directions.

Update: it seems the traffic speed is relatively unstable. I tried to move to OpenVZ kernel as well and got the result that with the same tunings and adaptive interrupts I got the following result (notice the varying speed and how it jumps when the other direction traffic is sent, which I've highlighted in red):


# iperf -w 1M -i 5 -t 100 -c 192.168.10.2 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  4] local 192.168.10.1 port 52035 connected with 192.168.10.2 port 5001
[  3] local 192.168.10.1 port 52036 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 5.0 sec  2.60 GBytes  4.46 Gbits/sec
[  3]  0.0- 5.0 sec  2.60 GBytes  4.46 Gbits/sec
[SUM]  0.0- 5.0 sec  5.19 GBytes  8.92 Gbits/sec
[  4]  5.0-10.0 sec  1.76 GBytes  3.03 Gbits/sec
[  3]  5.0-10.0 sec  1.76 GBytes  3.03 Gbits/sec
[SUM]  5.0-10.0 sec  3.53 GBytes  6.06 Gbits/sec
[  4] 10.0-15.0 sec  2.60 GBytes  4.46 Gbits/sec
[  3] 10.0-15.0 sec  2.60 GBytes  4.46 Gbits/sec
[SUM] 10.0-15.0 sec  5.19 GBytes  8.92 Gbits/sec
[  4] 15.0-20.0 sec   695 MBytes  1.17 Gbits/sec
[  3] 15.0-20.0 sec   693 MBytes  1.16 Gbits/sec
[SUM] 15.0-20.0 sec  1.36 GBytes  2.33 Gbits/sec
[  4] 20.0-25.0 sec   694 MBytes  1.17 Gbits/sec
[  3] 20.0-25.0 sec   694 MBytes  1.16 Gbits/sec
[SUM] 20.0-25.0 sec  1.36 GBytes  2.33 Gbits/sec
[  4] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47385
[  6] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47384
[  4] 25.0-30.0 sec  1.85 GBytes  3.18 Gbits/sec
[  3] 25.0-30.0 sec  2.42 GBytes  4.16 Gbits/sec
[SUM] 25.0-30.0 sec  4.28 GBytes  7.35 Gbits/sec
[  3] 30.0-35.0 sec  2.36 GBytes  4.06 Gbits/sec
[  4] 30.0-35.0 sec  1.70 GBytes  2.93 Gbits/sec
[SUM] 30.0-35.0 sec  4.06 GBytes  6.98 Gbits/sec
[  3] 35.0-40.0 sec  2.83 GBytes  4.86 Gbits/sec
[  4] 35.0-40.0 sec  2.10 GBytes  3.61 Gbits/sec
[SUM] 35.0-40.0 sec  4.93 GBytes  8.47 Gbits/sec
[  4] 40.0-45.0 sec   820 MBytes  1.38 Gbits/sec
[  3] 40.0-45.0 sec  1.37 GBytes  2.36 Gbits/sec
[SUM] 40.0-45.0 sec  2.17 GBytes  3.73 Gbits/sec
[  4] 45.0-50.0 sec  2.30 GBytes  3.96 Gbits/sec
[  3] 45.0-50.0 sec  3.02 GBytes  5.20 Gbits/sec
[SUM] 45.0-50.0 sec  5.33 GBytes  9.15 Gbits/sec
[  4] 50.0-55.0 sec  1.37 GBytes  2.36 Gbits/sec
[  3] 50.0-55.0 sec  2.00 GBytes  3.43 Gbits/sec
[SUM] 50.0-55.0 sec  3.37 GBytes  5.79 Gbits/sec
[  4]  0.0-30.9 sec  12.4 GBytes  3.46 Gbits/sec
[  6]  0.0-30.9 sec  12.6 GBytes  3.50 Gbits/sec
[SUM]  0.0-30.9 sec  25.0 GBytes  6.96 Gbits/sec
[  4] 55.0-60.0 sec  2.63 GBytes  4.51 Gbits/sec
[  3] 55.0-60.0 sec  2.89 GBytes  4.96 Gbits/sec
[SUM] 55.0-60.0 sec  5.51 GBytes  9.47 Gbits/sec
[  4] 60.0-65.0 sec  2.60 GBytes  4.47 Gbits/sec
[  3] 60.0-65.0 sec  2.60 GBytes  4.47 Gbits/sec
[SUM] 60.0-65.0 sec  5.20 GBytes  8.94 Gbits/sec
[  4] 65.0-70.0 sec   695 MBytes  1.17 Gbits/sec
[  3] 65.0-70.0 sec   696 MBytes  1.17 Gbits/sec
[SUM] 65.0-70.0 sec  1.36 GBytes  2.33 Gbits/sec
[  4] 70.0-75.0 sec   858 MBytes  1.44 Gbits/sec
[  3] 70.0-75.0 sec   858 MBytes  1.44 Gbits/sec
[SUM] 70.0-75.0 sec  1.67 GBytes  2.88 Gbits/sec
[  4] 75.0-80.0 sec  2.76 GBytes  4.74 Gbits/sec
[  3] 75.0-80.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM] 75.0-80.0 sec  5.51 GBytes  9.47 Gbits/sec
[  4] 80.0-85.0 sec  2.60 GBytes  4.46 Gbits/sec
[  3] 80.0-85.0 sec  2.60 GBytes  4.46 Gbits/sec
[SUM] 80.0-85.0 sec  5.19 GBytes  8.92 Gbits/sec
[  3] 85.0-90.0 sec   694 MBytes  1.16 Gbits/sec
[  4] 85.0-90.0 sec   697 MBytes  1.17 Gbits/sec
[SUM] 85.0-90.0 sec  1.36 GBytes  2.33 Gbits/sec
[  4] 90.0-95.0 sec   695 MBytes  1.17 Gbits/sec
[  3] 90.0-95.0 sec   694 MBytes  1.17 Gbits/sec
[SUM] 90.0-95.0 sec  1.36 GBytes  2.33 Gbits/sec
[  4] 95.0-100.0 sec   696 MBytes  1.17 Gbits/sec
[  4]  0.0-100.0 sec  32.6 GBytes  2.80 Gbits/sec
[  3] 95.0-100.0 sec   694 MBytes  1.16 Gbits/sec
[SUM] 95.0-100.0 sec  1.36 GBytes  2.33 Gbits/sec
[  3]  0.0-100.0 sec  36.7 GBytes  3.15 Gbits/sec
[SUM]  0.0-100.0 sec  69.3 GBytes  5.95 Gbits/sec
[root@wn-d-01 mlnx_en-1.5.9]# 

reducing the rx-usec and rx-frames to 0 as recommended in the performance guide I can't really get past 3Gbit/s. So it does point towards some issue with interrupts.

So as a final test as Mellanox driver package provide scripts to set IRQ affinity for mellanox interfaces I tried fixing it and retesting. On both nodes:


# set_irq_affinity_cpulist.sh 0 eth2
-------------------------------------
Optimizing IRQs for Single port traffic
-------------------------------------
Discovered irqs for eth2: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
Assign irq 120 mask 0x1
Assign irq 121 mask 0x1
Assign irq 122 mask 0x1
Assign irq 123 mask 0x1
Assign irq 124 mask 0x1
Assign irq 125 mask 0x1
Assign irq 126 mask 0x1
Assign irq 127 mask 0x1
Assign irq 128 mask 0x1
Assign irq 129 mask 0x1
Assign irq 130 mask 0x1
Assign irq 131 mask 0x1
Assign irq 132 mask 0x1
Assign irq 133 mask 0x1
Assign irq 134 mask 0x1
Assign irq 135 mask 0x1

done.

And the result:


[root@wn-d-01 ~]# iperf -w 1M -i 5 -t 100 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 192.168.10.1 port 52039 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 5.0 sec  5.51 GBytes  9.46 Gbits/sec
[  3]  5.0-10.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 10.0-15.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 15.0-20.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 20.0-25.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 25.0-30.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 30.0-35.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 35.0-40.0 sec  5.51 GBytes  9.47 Gbits/sec
^C[  3]  0.0-41.6 sec  45.8 GBytes  9.47 Gbits/sec
[root@wn-d-01 ~]# 



works both single and multiple streams and on one and both directions. Yay!. Now we just have to solve this on ALL nodes :)

No comments:

Post a Comment