So today we decided to take the network testing seriously. We connected two servers with 1G network to the generic infrastructure (to access software from internet) and then a direct cable between them for 10G. This way we should be able to test exclusively the OS + driver + NIC fw level that we can get to 10Gbit and from there expand the test to include a switch.
To make sure we didn't have any stale software we reinstalled both nodes with CentOS 6.3 with no kernel tunings. We also compared sysctl -a output on the two servers and though there were minor differences none of them should prove relevant.
We then launched a few baseline tests. First of all running iperf locally inside the server to see how much the server itself can handle. With 256KB window size both did at least 17Gbit/s
[root@wn-d-01 ~]# iperf -w 256k -c 192.168.2.1 -i 1
------------------------------------------------------------
Client connecting to 192.168.2.1, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 55968 connected with 192.168.2.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1.77 GBytes 15.2 Gbits/sec
[ 3] 1.0- 2.0 sec 2.37 GBytes 20.4 Gbits/sec
[ 3] 2.0- 3.0 sec 2.42 GBytes 20.8 Gbits/sec
[ 3] 3.0- 4.0 sec 2.40 GBytes 20.6 Gbits/sec
[ 3] 4.0- 5.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 5.0- 6.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 6.0- 7.0 sec 2.39 GBytes 20.6 Gbits/sec
[ 3] 7.0- 8.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 8.0- 9.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 9.0-10.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 0.0-10.0 sec 23.3 GBytes 20.0 Gbits/sec
[root@wn-d-01 ~]#
Running then the test between them firstly on the 1Gbit/s to see that there is no generic foul play at work that cuts speed to 25% or 33% we saw nicely 942Mbit/s speeds:
[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.2.98
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 53338 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 112 MBytes 943 Mbits/sec
[ 3] 1.0- 2.0 sec 112 MBytes 942 Mbits/sec
[ 3] 2.0- 3.0 sec 112 MBytes 942 Mbits/sec
[ 3] 3.0- 4.0 sec 112 MBytes 942 Mbits/sec
[ 3] 4.0- 5.0 sec 112 MBytes 942 Mbits/sec
[ 3] 5.0- 6.0 sec 112 MBytes 941 Mbits/sec
[ 3] 6.0- 7.0 sec 112 MBytes 942 Mbits/sec
[ 3] 7.0- 8.0 sec 112 MBytes 942 Mbits/sec
[ 3] 8.0- 9.0 sec 112 MBytes 942 Mbits/sec
[ 3] 9.0-10.0 sec 112 MBytes 942 Mbits/sec
[ 3] 0.0-10.0 sec 1.10 GBytes 942 Mbits/sec
[root@wn-d-01 ~]#
So now we fired up the default kernel driver for our Mellanox 10G card and tested the link firstly with ping:
[root@wn-d-98 ~]# ping 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.725 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.177 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.187 ms
So an RTT of 0.2ms means that with 256KB window size you get 9.8Gbit/s. So let's see if that actually works (remember, it's direct attached cable, nothing else running on the servers):
[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 51131 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 290 MBytes 2.43 Gbits/sec
[ 3] 1.0- 2.0 sec 296 MBytes 2.49 Gbits/sec
[ 3] 2.0- 3.0 sec 191 MBytes 1.61 Gbits/sec
[ 3] 3.0- 4.0 sec 320 MBytes 2.68 Gbits/sec
[ 3] 4.0- 5.0 sec 232 MBytes 1.95 Gbits/sec
[ 3] 5.0- 6.0 sec 161 MBytes 1.35 Gbits/sec
[ 3] 6.0- 7.0 sec 135 MBytes 1.13 Gbits/sec
[ 3] 7.0- 8.0 sec 249 MBytes 2.09 Gbits/sec
[ 3] 8.0- 9.0 sec 224 MBytes 1.88 Gbits/sec
[ 3] 9.0-10.0 sec 182 MBytes 1.53 Gbits/sec
[ 3] 0.0-10.0 sec 2.23 GBytes 1.91 Gbits/sec
[root@wn-d-01 ~]#
Not even close. So next up we installed Mellanox official kernel modules. With those we could also increase the window size to 1-2MB etc (which the default somehow capped at 256KB). Though this shouldn't matter. The first test looked promising the first few seconds:
[root@wn-d-01 ~]# iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58336 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1021 MBytes 8.57 Gbits/sec
[ 3] 1.0- 2.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 2.0- 3.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 3.0- 4.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 4.0- 5.0 sec 933 MBytes 7.82 Gbits/sec
[ 3] 5.0- 6.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 6.0- 7.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 7.0- 8.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 8.0- 9.0 sec 276 MBytes 2.32 Gbits/sec
[ 3] 9.0-10.0 sec 277 MBytes 2.33 Gbits/sec
No matter how hard we tried we couldn't repeat the 9.47Gb/s speeds. Digging into Mellanox network performance tuning guide I first set the default kernel parameters according to them to higher values however that had absolutely no impact on throughput.
The tunings they recommend:
# disable TCP timestamps
sysctl -w net.ipv4.tcp_timestamps=0
# Disable the TCP selective acks option for better CPU utilization:
sysctl -w net.ipv4.tcp_sack=0
# Increase the maximum length of processor input queues:
sysctl -w net.core.netdev_max_backlog=250000
# Increase the TCP maximum and default buffer sizes using setsockopt():
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216
sysctl -w net.core.optmem_max=16777216
# Increase memory thresholds to prevent packet dropping:
sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"
# Increase Linux’s auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are:
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
# Enable low latency mode for TCP:
sysctl -w net.ipv4.tcp_low_latency=1
However what did impact somewhat is turning off adaptive interrupt moderation. Though only for a short time. We're getting 7Gbit/s from one node to the other, but the other direction was able to do 7 Gbit/s only for a few seconds before hiccuping and going down to 2.3Gbit/s again:
iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58341 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 856 MBytes 7.18 Gbits/sec
[ 3] 1.0- 2.0 sec 855 MBytes 7.17 Gbits/sec
[ 3] 2.0- 3.0 sec 879 MBytes 7.37 Gbits/sec
[ 3] 3.0- 4.0 sec 902 MBytes 7.57 Gbits/sec
[ 3] 4.0- 5.0 sec 854 MBytes 7.16 Gbits/sec
[ 3] 5.0- 6.0 sec 203 MBytes 1.71 Gbits/sec
[ 3] 6.0- 7.0 sec 306 MBytes 2.56 Gbits/sec
[ 3] 7.0- 8.0 sec 852 MBytes 7.15 Gbits/sec
[ 3] 8.0- 9.0 sec 799 MBytes 6.70 Gbits/sec
[ 3] 9.0-10.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 10.0-11.0 sec 323 MBytes 2.71 Gbits/sec
[ 3] 11.0-12.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 12.0-13.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 13.0-14.0 sec 277 MBytes 2.32 Gbits/sec
...
Reading further I changed the adaptive interrupt moderation to this:
ethtool -C eth2 adaptive-rx off rx-usecs 32 rx-frames 32
Running two parallel streams and bidirectional test gives a pretty good result.
wn-d-98 as client, wn-d-01 as server:
# iperf -w 1M -i 5 -t 20 -c 192.168.10.1 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.1, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.2 port 56125 connected with 192.168.10.1 port 5001
[ 3] local 192.168.10.2 port 56126 connected with 192.168.10.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec
The other direction:
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58460 connected with 192.168.10.2 port 5001
[ 4] local 192.168.10.1 port 58459 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 10.0-15.0 sec 5.52 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 15.0-20.0 sec 5.52 GBytes 9.48 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.73 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec
So with two streams we can saturate the network. Testing again with 1 stream we got d-98 -> d-01 at 9.5Gb/s, but the reverse was at 2.3 Gb/s. Running d-01 -> d-98 at -P 2 got to 9.5G again. Bizarre. The first test now is to see what happens after reboot.
After reboot we see pretty much the same state. Single stream sucks, two streams get 9.5G at least initially and then speeds slow down if you run them parallel in both directions. The tuning script was not made restart safe exactly to see if it had any effect. Same for the adaptive interrupts. Setting both up again we get back to the previous state where single stream speeds are 2-3Gbit/s when run at the same time and two parallel streams get 9.5G in both directions.
Update: it seems the traffic speed is relatively unstable. I tried to move to OpenVZ kernel as well and got the result that with the same tunings and adaptive interrupts I got the following result (notice the varying speed and how it jumps when the other direction traffic is sent, which I've highlighted in red):
# iperf -w 1M -i 5 -t 100 -c 192.168.10.2 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.1 port 52035 connected with 192.168.10.2 port 5001
[ 3] local 192.168.10.1 port 52036 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 0.0- 5.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[ 3] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[SUM] 5.0-10.0 sec 3.53 GBytes 6.06 Gbits/sec
[ 4] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 10.0-15.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 15.0-20.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 15.0-20.0 sec 693 MBytes 1.16 Gbits/sec
[SUM] 15.0-20.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 20.0-25.0 sec 694 MBytes 1.17 Gbits/sec
[ 3] 20.0-25.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 20.0-25.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47385
[ 6] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47384
[ 4] 25.0-30.0 sec 1.85 GBytes 3.18 Gbits/sec
[ 3] 25.0-30.0 sec 2.42 GBytes 4.16 Gbits/sec
[SUM] 25.0-30.0 sec 4.28 GBytes 7.35 Gbits/sec
[ 3] 30.0-35.0 sec 2.36 GBytes 4.06 Gbits/sec
[ 4] 30.0-35.0 sec 1.70 GBytes 2.93 Gbits/sec
[SUM] 30.0-35.0 sec 4.06 GBytes 6.98 Gbits/sec
[ 3] 35.0-40.0 sec 2.83 GBytes 4.86 Gbits/sec
[ 4] 35.0-40.0 sec 2.10 GBytes 3.61 Gbits/sec
[SUM] 35.0-40.0 sec 4.93 GBytes 8.47 Gbits/sec
[ 4] 40.0-45.0 sec 820 MBytes 1.38 Gbits/sec
[ 3] 40.0-45.0 sec 1.37 GBytes 2.36 Gbits/sec
[SUM] 40.0-45.0 sec 2.17 GBytes 3.73 Gbits/sec
[ 4] 45.0-50.0 sec 2.30 GBytes 3.96 Gbits/sec
[ 3] 45.0-50.0 sec 3.02 GBytes 5.20 Gbits/sec
[SUM] 45.0-50.0 sec 5.33 GBytes 9.15 Gbits/sec
[ 4] 50.0-55.0 sec 1.37 GBytes 2.36 Gbits/sec
[ 3] 50.0-55.0 sec 2.00 GBytes 3.43 Gbits/sec
[SUM] 50.0-55.0 sec 3.37 GBytes 5.79 Gbits/sec
[ 4] 0.0-30.9 sec 12.4 GBytes 3.46 Gbits/sec
[ 6] 0.0-30.9 sec 12.6 GBytes 3.50 Gbits/sec
[SUM] 0.0-30.9 sec 25.0 GBytes 6.96 Gbits/sec
[ 4] 55.0-60.0 sec 2.63 GBytes 4.51 Gbits/sec
[ 3] 55.0-60.0 sec 2.89 GBytes 4.96 Gbits/sec
[SUM] 55.0-60.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[ 3] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[SUM] 60.0-65.0 sec 5.20 GBytes 8.94 Gbits/sec
[ 4] 65.0-70.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 65.0-70.0 sec 696 MBytes 1.17 Gbits/sec
[SUM] 65.0-70.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[ 3] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[SUM] 70.0-75.0 sec 1.67 GBytes 2.88 Gbits/sec
[ 4] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 75.0-80.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 80.0-85.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 3] 85.0-90.0 sec 694 MBytes 1.16 Gbits/sec
[ 4] 85.0-90.0 sec 697 MBytes 1.17 Gbits/sec
[SUM] 85.0-90.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 90.0-95.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 90.0-95.0 sec 694 MBytes 1.17 Gbits/sec
[SUM] 90.0-95.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 95.0-100.0 sec 696 MBytes 1.17 Gbits/sec
[ 4] 0.0-100.0 sec 32.6 GBytes 2.80 Gbits/sec
[ 3] 95.0-100.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 95.0-100.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 3] 0.0-100.0 sec 36.7 GBytes 3.15 Gbits/sec
[SUM] 0.0-100.0 sec 69.3 GBytes 5.95 Gbits/sec
[root@wn-d-01 mlnx_en-1.5.9]#
reducing the rx-usec and rx-frames to 0 as recommended in the performance guide I can't really get past 3Gbit/s. So it does point towards some issue with interrupts.
So as a final test as Mellanox driver package provide scripts to set IRQ affinity for mellanox interfaces I tried fixing it and retesting. On both nodes:
# set_irq_affinity_cpulist.sh 0 eth2
-------------------------------------
Optimizing IRQs for Single port traffic
-------------------------------------
Discovered irqs for eth2: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
Assign irq 120 mask 0x1
Assign irq 121 mask 0x1
Assign irq 122 mask 0x1
Assign irq 123 mask 0x1
Assign irq 124 mask 0x1
Assign irq 125 mask 0x1
Assign irq 126 mask 0x1
Assign irq 127 mask 0x1
Assign irq 128 mask 0x1
Assign irq 129 mask 0x1
Assign irq 130 mask 0x1
Assign irq 131 mask 0x1
Assign irq 132 mask 0x1
Assign irq 133 mask 0x1
Assign irq 134 mask 0x1
Assign irq 135 mask 0x1
done.
And the result:
[root@wn-d-01 ~]# iperf -w 1M -i 5 -t 100 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 52039 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 5.51 GBytes 9.46 Gbits/sec
[ 3] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 20.0-25.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 25.0-30.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 30.0-35.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 35.0-40.0 sec 5.51 GBytes 9.47 Gbits/sec
^C[ 3] 0.0-41.6 sec 45.8 GBytes 9.47 Gbits/sec
[root@wn-d-01 ~]#
works both single and multiple streams and on one and both directions. Yay!. Now we just have to solve this on ALL nodes :)
No comments:
Post a Comment