Today I ran into a situation where gluster was stuck. You could check info and status, but nothing more spectacular. When you tried enabling profiling or heal you'd get in log:
[2012-12-13 15:40:07.881363] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: c3ce6b9c-6297-4e77-924c-b44e2c13e58f, lock held by: c3ce6b9c-6297-4e77-924c-b44e2c13e58f
As can be seen the node itself can't get a lock that it itself is holding. Googling around it seems that the error comes from a command that failed to complete correctly and didn't release the lock. The web talks that this lock should auto-disappear in 30 minutes, but that's not the best solution. Also I was able to get rid of the lock a few times with
gluster volume statedump home0
but not always. When I managed to clear the lock things seemed to work (profiling started and gave info etc), but when I ran again
gluster volume heal home0
it locked up again. In the end the only help that I got from #gluster was to unmount all clients, stop all gluster processes and restart them. That indeed cleared the locks and heal command also works, but it's not really a solution I'd like to ever see again. I've written to gluster-users list and will update this entry if I get an actual way that this should have been cleared.
Thursday, December 13, 2012
Wednesday, December 5, 2012
Network debugging for 10Gbit/s
So today we decided to take the network testing seriously. We connected two servers with 1G network to the generic infrastructure (to access software from internet) and then a direct cable between them for 10G. This way we should be able to test exclusively the OS + driver + NIC fw level that we can get to 10Gbit and from there expand the test to include a switch.
To make sure we didn't have any stale software we reinstalled both nodes with CentOS 6.3 with no kernel tunings. We also compared sysctl -a output on the two servers and though there were minor differences none of them should prove relevant.
We then launched a few baseline tests. First of all running iperf locally inside the server to see how much the server itself can handle. With 256KB window size both did at least 17Gbit/s
[root@wn-d-01 ~]# iperf -w 256k -c 192.168.2.1 -i 1
------------------------------------------------------------
Client connecting to 192.168.2.1, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 55968 connected with 192.168.2.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1.77 GBytes 15.2 Gbits/sec
[ 3] 1.0- 2.0 sec 2.37 GBytes 20.4 Gbits/sec
[ 3] 2.0- 3.0 sec 2.42 GBytes 20.8 Gbits/sec
[ 3] 3.0- 4.0 sec 2.40 GBytes 20.6 Gbits/sec
[ 3] 4.0- 5.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 5.0- 6.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 6.0- 7.0 sec 2.39 GBytes 20.6 Gbits/sec
[ 3] 7.0- 8.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 8.0- 9.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 9.0-10.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 0.0-10.0 sec 23.3 GBytes 20.0 Gbits/sec
[root@wn-d-01 ~]#
Running then the test between them firstly on the 1Gbit/s to see that there is no generic foul play at work that cuts speed to 25% or 33% we saw nicely 942Mbit/s speeds:
[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.2.98
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 53338 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 112 MBytes 943 Mbits/sec
[ 3] 1.0- 2.0 sec 112 MBytes 942 Mbits/sec
[ 3] 2.0- 3.0 sec 112 MBytes 942 Mbits/sec
[ 3] 3.0- 4.0 sec 112 MBytes 942 Mbits/sec
[ 3] 4.0- 5.0 sec 112 MBytes 942 Mbits/sec
[ 3] 5.0- 6.0 sec 112 MBytes 941 Mbits/sec
[ 3] 6.0- 7.0 sec 112 MBytes 942 Mbits/sec
[ 3] 7.0- 8.0 sec 112 MBytes 942 Mbits/sec
[ 3] 8.0- 9.0 sec 112 MBytes 942 Mbits/sec
[ 3] 9.0-10.0 sec 112 MBytes 942 Mbits/sec
[ 3] 0.0-10.0 sec 1.10 GBytes 942 Mbits/sec
[root@wn-d-01 ~]#
So now we fired up the default kernel driver for our Mellanox 10G card and tested the link firstly with ping:
[root@wn-d-98 ~]# ping 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.725 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.177 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.187 ms
So an RTT of 0.2ms means that with 256KB window size you get 9.8Gbit/s. So let's see if that actually works (remember, it's direct attached cable, nothing else running on the servers):
[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 51131 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 290 MBytes 2.43 Gbits/sec
[ 3] 1.0- 2.0 sec 296 MBytes 2.49 Gbits/sec
[ 3] 2.0- 3.0 sec 191 MBytes 1.61 Gbits/sec
[ 3] 3.0- 4.0 sec 320 MBytes 2.68 Gbits/sec
[ 3] 4.0- 5.0 sec 232 MBytes 1.95 Gbits/sec
[ 3] 5.0- 6.0 sec 161 MBytes 1.35 Gbits/sec
[ 3] 6.0- 7.0 sec 135 MBytes 1.13 Gbits/sec
[ 3] 7.0- 8.0 sec 249 MBytes 2.09 Gbits/sec
[ 3] 8.0- 9.0 sec 224 MBytes 1.88 Gbits/sec
[ 3] 9.0-10.0 sec 182 MBytes 1.53 Gbits/sec
[ 3] 0.0-10.0 sec 2.23 GBytes 1.91 Gbits/sec
[root@wn-d-01 ~]#
Not even close. So next up we installed Mellanox official kernel modules. With those we could also increase the window size to 1-2MB etc (which the default somehow capped at 256KB). Though this shouldn't matter. The first test looked promising the first few seconds:
[root@wn-d-01 ~]# iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58336 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1021 MBytes 8.57 Gbits/sec
[ 3] 1.0- 2.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 2.0- 3.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 3.0- 4.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 4.0- 5.0 sec 933 MBytes 7.82 Gbits/sec
[ 3] 5.0- 6.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 6.0- 7.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 7.0- 8.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 8.0- 9.0 sec 276 MBytes 2.32 Gbits/sec
[ 3] 9.0-10.0 sec 277 MBytes 2.33 Gbits/sec
No matter how hard we tried we couldn't repeat the 9.47Gb/s speeds. Digging into Mellanox network performance tuning guide I first set the default kernel parameters according to them to higher values however that had absolutely no impact on throughput.
The tunings they recommend:
# disable TCP timestamps
sysctl -w net.ipv4.tcp_timestamps=0
# Disable the TCP selective acks option for better CPU utilization:
sysctl -w net.ipv4.tcp_sack=0
# Increase the maximum length of processor input queues:
sysctl -w net.core.netdev_max_backlog=250000
# Increase the TCP maximum and default buffer sizes using setsockopt():
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216
sysctl -w net.core.optmem_max=16777216
# Increase memory thresholds to prevent packet dropping:
sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"
# Increase Linux’s auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are:
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
# Enable low latency mode for TCP:
sysctl -w net.ipv4.tcp_low_latency=1
However what did impact somewhat is turning off adaptive interrupt moderation. Though only for a short time. We're getting 7Gbit/s from one node to the other, but the other direction was able to do 7 Gbit/s only for a few seconds before hiccuping and going down to 2.3Gbit/s again:
iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58341 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 856 MBytes 7.18 Gbits/sec
[ 3] 1.0- 2.0 sec 855 MBytes 7.17 Gbits/sec
[ 3] 2.0- 3.0 sec 879 MBytes 7.37 Gbits/sec
[ 3] 3.0- 4.0 sec 902 MBytes 7.57 Gbits/sec
[ 3] 4.0- 5.0 sec 854 MBytes 7.16 Gbits/sec
[ 3] 5.0- 6.0 sec 203 MBytes 1.71 Gbits/sec
[ 3] 6.0- 7.0 sec 306 MBytes 2.56 Gbits/sec
[ 3] 7.0- 8.0 sec 852 MBytes 7.15 Gbits/sec
[ 3] 8.0- 9.0 sec 799 MBytes 6.70 Gbits/sec
[ 3] 9.0-10.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 10.0-11.0 sec 323 MBytes 2.71 Gbits/sec
[ 3] 11.0-12.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 12.0-13.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 13.0-14.0 sec 277 MBytes 2.32 Gbits/sec
...
Reading further I changed the adaptive interrupt moderation to this:
ethtool -C eth2 adaptive-rx off rx-usecs 32 rx-frames 32
Running two parallel streams and bidirectional test gives a pretty good result.
wn-d-98 as client, wn-d-01 as server:
# iperf -w 1M -i 5 -t 20 -c 192.168.10.1 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.1, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.2 port 56125 connected with 192.168.10.1 port 5001
[ 3] local 192.168.10.2 port 56126 connected with 192.168.10.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec
The other direction:
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58460 connected with 192.168.10.2 port 5001
[ 4] local 192.168.10.1 port 58459 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 10.0-15.0 sec 5.52 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 15.0-20.0 sec 5.52 GBytes 9.48 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.73 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec
So with two streams we can saturate the network. Testing again with 1 stream we got d-98 -> d-01 at 9.5Gb/s, but the reverse was at 2.3 Gb/s. Running d-01 -> d-98 at -P 2 got to 9.5G again. Bizarre. The first test now is to see what happens after reboot.
After reboot we see pretty much the same state. Single stream sucks, two streams get 9.5G at least initially and then speeds slow down if you run them parallel in both directions. The tuning script was not made restart safe exactly to see if it had any effect. Same for the adaptive interrupts. Setting both up again we get back to the previous state where single stream speeds are 2-3Gbit/s when run at the same time and two parallel streams get 9.5G in both directions.
Update: it seems the traffic speed is relatively unstable. I tried to move to OpenVZ kernel as well and got the result that with the same tunings and adaptive interrupts I got the following result (notice the varying speed and how it jumps when the other direction traffic is sent, which I've highlighted in red):
# iperf -w 1M -i 5 -t 100 -c 192.168.10.2 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.1 port 52035 connected with 192.168.10.2 port 5001
[ 3] local 192.168.10.1 port 52036 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 0.0- 5.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[ 3] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[SUM] 5.0-10.0 sec 3.53 GBytes 6.06 Gbits/sec
[ 4] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 10.0-15.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 15.0-20.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 15.0-20.0 sec 693 MBytes 1.16 Gbits/sec
[SUM] 15.0-20.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 20.0-25.0 sec 694 MBytes 1.17 Gbits/sec
[ 3] 20.0-25.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 20.0-25.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47385
[ 6] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47384
[ 4] 25.0-30.0 sec 1.85 GBytes 3.18 Gbits/sec
[ 3] 25.0-30.0 sec 2.42 GBytes 4.16 Gbits/sec
[SUM] 25.0-30.0 sec 4.28 GBytes 7.35 Gbits/sec
[ 3] 30.0-35.0 sec 2.36 GBytes 4.06 Gbits/sec
[ 4] 30.0-35.0 sec 1.70 GBytes 2.93 Gbits/sec
[SUM] 30.0-35.0 sec 4.06 GBytes 6.98 Gbits/sec
[ 3] 35.0-40.0 sec 2.83 GBytes 4.86 Gbits/sec
[ 4] 35.0-40.0 sec 2.10 GBytes 3.61 Gbits/sec
[SUM] 35.0-40.0 sec 4.93 GBytes 8.47 Gbits/sec
[ 4] 40.0-45.0 sec 820 MBytes 1.38 Gbits/sec
[ 3] 40.0-45.0 sec 1.37 GBytes 2.36 Gbits/sec
[SUM] 40.0-45.0 sec 2.17 GBytes 3.73 Gbits/sec
[ 4] 45.0-50.0 sec 2.30 GBytes 3.96 Gbits/sec
[ 3] 45.0-50.0 sec 3.02 GBytes 5.20 Gbits/sec
[SUM] 45.0-50.0 sec 5.33 GBytes 9.15 Gbits/sec
[ 4] 50.0-55.0 sec 1.37 GBytes 2.36 Gbits/sec
[ 3] 50.0-55.0 sec 2.00 GBytes 3.43 Gbits/sec
[SUM] 50.0-55.0 sec 3.37 GBytes 5.79 Gbits/sec
[ 4] 0.0-30.9 sec 12.4 GBytes 3.46 Gbits/sec
[ 6] 0.0-30.9 sec 12.6 GBytes 3.50 Gbits/sec
[SUM] 0.0-30.9 sec 25.0 GBytes 6.96 Gbits/sec
[ 4] 55.0-60.0 sec 2.63 GBytes 4.51 Gbits/sec
[ 3] 55.0-60.0 sec 2.89 GBytes 4.96 Gbits/sec
[SUM] 55.0-60.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[ 3] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[SUM] 60.0-65.0 sec 5.20 GBytes 8.94 Gbits/sec
[ 4] 65.0-70.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 65.0-70.0 sec 696 MBytes 1.17 Gbits/sec
[SUM] 65.0-70.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[ 3] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[SUM] 70.0-75.0 sec 1.67 GBytes 2.88 Gbits/sec
[ 4] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 75.0-80.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 80.0-85.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 3] 85.0-90.0 sec 694 MBytes 1.16 Gbits/sec
[ 4] 85.0-90.0 sec 697 MBytes 1.17 Gbits/sec
[SUM] 85.0-90.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 90.0-95.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 90.0-95.0 sec 694 MBytes 1.17 Gbits/sec
[SUM] 90.0-95.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 95.0-100.0 sec 696 MBytes 1.17 Gbits/sec
[ 4] 0.0-100.0 sec 32.6 GBytes 2.80 Gbits/sec
[ 3] 95.0-100.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 95.0-100.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 3] 0.0-100.0 sec 36.7 GBytes 3.15 Gbits/sec
[SUM] 0.0-100.0 sec 69.3 GBytes 5.95 Gbits/sec
[root@wn-d-01 mlnx_en-1.5.9]#
reducing the rx-usec and rx-frames to 0 as recommended in the performance guide I can't really get past 3Gbit/s. So it does point towards some issue with interrupts.
So as a final test as Mellanox driver package provide scripts to set IRQ affinity for mellanox interfaces I tried fixing it and retesting. On both nodes:
# set_irq_affinity_cpulist.sh 0 eth2
-------------------------------------
Optimizing IRQs for Single port traffic
-------------------------------------
Discovered irqs for eth2: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
Assign irq 120 mask 0x1
Assign irq 121 mask 0x1
Assign irq 122 mask 0x1
Assign irq 123 mask 0x1
Assign irq 124 mask 0x1
Assign irq 125 mask 0x1
Assign irq 126 mask 0x1
Assign irq 127 mask 0x1
Assign irq 128 mask 0x1
Assign irq 129 mask 0x1
Assign irq 130 mask 0x1
Assign irq 131 mask 0x1
Assign irq 132 mask 0x1
Assign irq 133 mask 0x1
Assign irq 134 mask 0x1
Assign irq 135 mask 0x1
done.
And the result:
[root@wn-d-01 ~]# iperf -w 1M -i 5 -t 100 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 52039 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 5.51 GBytes 9.46 Gbits/sec
[ 3] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 20.0-25.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 25.0-30.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 30.0-35.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 35.0-40.0 sec 5.51 GBytes 9.47 Gbits/sec
^C[ 3] 0.0-41.6 sec 45.8 GBytes 9.47 Gbits/sec
[root@wn-d-01 ~]#
works both single and multiple streams and on one and both directions. Yay!. Now we just have to solve this on ALL nodes :)
To make sure we didn't have any stale software we reinstalled both nodes with CentOS 6.3 with no kernel tunings. We also compared sysctl -a output on the two servers and though there were minor differences none of them should prove relevant.
We then launched a few baseline tests. First of all running iperf locally inside the server to see how much the server itself can handle. With 256KB window size both did at least 17Gbit/s
[root@wn-d-01 ~]# iperf -w 256k -c 192.168.2.1 -i 1
------------------------------------------------------------
Client connecting to 192.168.2.1, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 55968 connected with 192.168.2.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1.77 GBytes 15.2 Gbits/sec
[ 3] 1.0- 2.0 sec 2.37 GBytes 20.4 Gbits/sec
[ 3] 2.0- 3.0 sec 2.42 GBytes 20.8 Gbits/sec
[ 3] 3.0- 4.0 sec 2.40 GBytes 20.6 Gbits/sec
[ 3] 4.0- 5.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 5.0- 6.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 6.0- 7.0 sec 2.39 GBytes 20.6 Gbits/sec
[ 3] 7.0- 8.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 8.0- 9.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 9.0-10.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 0.0-10.0 sec 23.3 GBytes 20.0 Gbits/sec
[root@wn-d-01 ~]#
Running then the test between them firstly on the 1Gbit/s to see that there is no generic foul play at work that cuts speed to 25% or 33% we saw nicely 942Mbit/s speeds:
[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.2.98
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 53338 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 112 MBytes 943 Mbits/sec
[ 3] 1.0- 2.0 sec 112 MBytes 942 Mbits/sec
[ 3] 2.0- 3.0 sec 112 MBytes 942 Mbits/sec
[ 3] 3.0- 4.0 sec 112 MBytes 942 Mbits/sec
[ 3] 4.0- 5.0 sec 112 MBytes 942 Mbits/sec
[ 3] 5.0- 6.0 sec 112 MBytes 941 Mbits/sec
[ 3] 6.0- 7.0 sec 112 MBytes 942 Mbits/sec
[ 3] 7.0- 8.0 sec 112 MBytes 942 Mbits/sec
[ 3] 8.0- 9.0 sec 112 MBytes 942 Mbits/sec
[ 3] 9.0-10.0 sec 112 MBytes 942 Mbits/sec
[ 3] 0.0-10.0 sec 1.10 GBytes 942 Mbits/sec
[root@wn-d-01 ~]#
So now we fired up the default kernel driver for our Mellanox 10G card and tested the link firstly with ping:
[root@wn-d-98 ~]# ping 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.725 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.177 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.187 ms
So an RTT of 0.2ms means that with 256KB window size you get 9.8Gbit/s. So let's see if that actually works (remember, it's direct attached cable, nothing else running on the servers):
[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 51131 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 290 MBytes 2.43 Gbits/sec
[ 3] 1.0- 2.0 sec 296 MBytes 2.49 Gbits/sec
[ 3] 2.0- 3.0 sec 191 MBytes 1.61 Gbits/sec
[ 3] 3.0- 4.0 sec 320 MBytes 2.68 Gbits/sec
[ 3] 4.0- 5.0 sec 232 MBytes 1.95 Gbits/sec
[ 3] 5.0- 6.0 sec 161 MBytes 1.35 Gbits/sec
[ 3] 6.0- 7.0 sec 135 MBytes 1.13 Gbits/sec
[ 3] 7.0- 8.0 sec 249 MBytes 2.09 Gbits/sec
[ 3] 8.0- 9.0 sec 224 MBytes 1.88 Gbits/sec
[ 3] 9.0-10.0 sec 182 MBytes 1.53 Gbits/sec
[ 3] 0.0-10.0 sec 2.23 GBytes 1.91 Gbits/sec
[root@wn-d-01 ~]#
Not even close. So next up we installed Mellanox official kernel modules. With those we could also increase the window size to 1-2MB etc (which the default somehow capped at 256KB). Though this shouldn't matter. The first test looked promising the first few seconds:
[root@wn-d-01 ~]# iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58336 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1021 MBytes 8.57 Gbits/sec
[ 3] 1.0- 2.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 2.0- 3.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 3.0- 4.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 4.0- 5.0 sec 933 MBytes 7.82 Gbits/sec
[ 3] 5.0- 6.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 6.0- 7.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 7.0- 8.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 8.0- 9.0 sec 276 MBytes 2.32 Gbits/sec
[ 3] 9.0-10.0 sec 277 MBytes 2.33 Gbits/sec
No matter how hard we tried we couldn't repeat the 9.47Gb/s speeds. Digging into Mellanox network performance tuning guide I first set the default kernel parameters according to them to higher values however that had absolutely no impact on throughput.
The tunings they recommend:
# disable TCP timestamps
sysctl -w net.ipv4.tcp_timestamps=0
# Disable the TCP selective acks option for better CPU utilization:
sysctl -w net.ipv4.tcp_sack=0
# Increase the maximum length of processor input queues:
sysctl -w net.core.netdev_max_backlog=250000
# Increase the TCP maximum and default buffer sizes using setsockopt():
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216
sysctl -w net.core.optmem_max=16777216
# Increase memory thresholds to prevent packet dropping:
sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"
# Increase Linux’s auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are:
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
# Enable low latency mode for TCP:
sysctl -w net.ipv4.tcp_low_latency=1
However what did impact somewhat is turning off adaptive interrupt moderation. Though only for a short time. We're getting 7Gbit/s from one node to the other, but the other direction was able to do 7 Gbit/s only for a few seconds before hiccuping and going down to 2.3Gbit/s again:
iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58341 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 856 MBytes 7.18 Gbits/sec
[ 3] 1.0- 2.0 sec 855 MBytes 7.17 Gbits/sec
[ 3] 2.0- 3.0 sec 879 MBytes 7.37 Gbits/sec
[ 3] 3.0- 4.0 sec 902 MBytes 7.57 Gbits/sec
[ 3] 4.0- 5.0 sec 854 MBytes 7.16 Gbits/sec
[ 3] 5.0- 6.0 sec 203 MBytes 1.71 Gbits/sec
[ 3] 6.0- 7.0 sec 306 MBytes 2.56 Gbits/sec
[ 3] 7.0- 8.0 sec 852 MBytes 7.15 Gbits/sec
[ 3] 8.0- 9.0 sec 799 MBytes 6.70 Gbits/sec
[ 3] 9.0-10.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 10.0-11.0 sec 323 MBytes 2.71 Gbits/sec
[ 3] 11.0-12.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 12.0-13.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 13.0-14.0 sec 277 MBytes 2.32 Gbits/sec
...
Reading further I changed the adaptive interrupt moderation to this:
ethtool -C eth2 adaptive-rx off rx-usecs 32 rx-frames 32
Running two parallel streams and bidirectional test gives a pretty good result.
wn-d-98 as client, wn-d-01 as server:
# iperf -w 1M -i 5 -t 20 -c 192.168.10.1 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.1, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.2 port 56125 connected with 192.168.10.1 port 5001
[ 3] local 192.168.10.2 port 56126 connected with 192.168.10.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec
The other direction:
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58460 connected with 192.168.10.2 port 5001
[ 4] local 192.168.10.1 port 58459 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 10.0-15.0 sec 5.52 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 15.0-20.0 sec 5.52 GBytes 9.48 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.73 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec
So with two streams we can saturate the network. Testing again with 1 stream we got d-98 -> d-01 at 9.5Gb/s, but the reverse was at 2.3 Gb/s. Running d-01 -> d-98 at -P 2 got to 9.5G again. Bizarre. The first test now is to see what happens after reboot.
After reboot we see pretty much the same state. Single stream sucks, two streams get 9.5G at least initially and then speeds slow down if you run them parallel in both directions. The tuning script was not made restart safe exactly to see if it had any effect. Same for the adaptive interrupts. Setting both up again we get back to the previous state where single stream speeds are 2-3Gbit/s when run at the same time and two parallel streams get 9.5G in both directions.
Update: it seems the traffic speed is relatively unstable. I tried to move to OpenVZ kernel as well and got the result that with the same tunings and adaptive interrupts I got the following result (notice the varying speed and how it jumps when the other direction traffic is sent, which I've highlighted in red):
# iperf -w 1M -i 5 -t 100 -c 192.168.10.2 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.1 port 52035 connected with 192.168.10.2 port 5001
[ 3] local 192.168.10.1 port 52036 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 0.0- 5.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[ 3] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[SUM] 5.0-10.0 sec 3.53 GBytes 6.06 Gbits/sec
[ 4] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 10.0-15.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 15.0-20.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 15.0-20.0 sec 693 MBytes 1.16 Gbits/sec
[SUM] 15.0-20.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 20.0-25.0 sec 694 MBytes 1.17 Gbits/sec
[ 3] 20.0-25.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 20.0-25.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47385
[ 6] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47384
[ 4] 25.0-30.0 sec 1.85 GBytes 3.18 Gbits/sec
[ 3] 25.0-30.0 sec 2.42 GBytes 4.16 Gbits/sec
[SUM] 25.0-30.0 sec 4.28 GBytes 7.35 Gbits/sec
[ 3] 30.0-35.0 sec 2.36 GBytes 4.06 Gbits/sec
[ 4] 30.0-35.0 sec 1.70 GBytes 2.93 Gbits/sec
[SUM] 30.0-35.0 sec 4.06 GBytes 6.98 Gbits/sec
[ 3] 35.0-40.0 sec 2.83 GBytes 4.86 Gbits/sec
[ 4] 35.0-40.0 sec 2.10 GBytes 3.61 Gbits/sec
[SUM] 35.0-40.0 sec 4.93 GBytes 8.47 Gbits/sec
[ 4] 40.0-45.0 sec 820 MBytes 1.38 Gbits/sec
[ 3] 40.0-45.0 sec 1.37 GBytes 2.36 Gbits/sec
[SUM] 40.0-45.0 sec 2.17 GBytes 3.73 Gbits/sec
[ 4] 45.0-50.0 sec 2.30 GBytes 3.96 Gbits/sec
[ 3] 45.0-50.0 sec 3.02 GBytes 5.20 Gbits/sec
[SUM] 45.0-50.0 sec 5.33 GBytes 9.15 Gbits/sec
[ 4] 50.0-55.0 sec 1.37 GBytes 2.36 Gbits/sec
[ 3] 50.0-55.0 sec 2.00 GBytes 3.43 Gbits/sec
[SUM] 50.0-55.0 sec 3.37 GBytes 5.79 Gbits/sec
[ 4] 0.0-30.9 sec 12.4 GBytes 3.46 Gbits/sec
[ 6] 0.0-30.9 sec 12.6 GBytes 3.50 Gbits/sec
[SUM] 0.0-30.9 sec 25.0 GBytes 6.96 Gbits/sec
[ 4] 55.0-60.0 sec 2.63 GBytes 4.51 Gbits/sec
[ 3] 55.0-60.0 sec 2.89 GBytes 4.96 Gbits/sec
[SUM] 55.0-60.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[ 3] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[SUM] 60.0-65.0 sec 5.20 GBytes 8.94 Gbits/sec
[ 4] 65.0-70.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 65.0-70.0 sec 696 MBytes 1.17 Gbits/sec
[SUM] 65.0-70.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[ 3] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[SUM] 70.0-75.0 sec 1.67 GBytes 2.88 Gbits/sec
[ 4] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 75.0-80.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 80.0-85.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 3] 85.0-90.0 sec 694 MBytes 1.16 Gbits/sec
[ 4] 85.0-90.0 sec 697 MBytes 1.17 Gbits/sec
[SUM] 85.0-90.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 90.0-95.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 90.0-95.0 sec 694 MBytes 1.17 Gbits/sec
[SUM] 90.0-95.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 95.0-100.0 sec 696 MBytes 1.17 Gbits/sec
[ 4] 0.0-100.0 sec 32.6 GBytes 2.80 Gbits/sec
[ 3] 95.0-100.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 95.0-100.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 3] 0.0-100.0 sec 36.7 GBytes 3.15 Gbits/sec
[SUM] 0.0-100.0 sec 69.3 GBytes 5.95 Gbits/sec
[root@wn-d-01 mlnx_en-1.5.9]#
reducing the rx-usec and rx-frames to 0 as recommended in the performance guide I can't really get past 3Gbit/s. So it does point towards some issue with interrupts.
So as a final test as Mellanox driver package provide scripts to set IRQ affinity for mellanox interfaces I tried fixing it and retesting. On both nodes:
# set_irq_affinity_cpulist.sh 0 eth2
-------------------------------------
Optimizing IRQs for Single port traffic
-------------------------------------
Discovered irqs for eth2: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
Assign irq 120 mask 0x1
Assign irq 121 mask 0x1
Assign irq 122 mask 0x1
Assign irq 123 mask 0x1
Assign irq 124 mask 0x1
Assign irq 125 mask 0x1
Assign irq 126 mask 0x1
Assign irq 127 mask 0x1
Assign irq 128 mask 0x1
Assign irq 129 mask 0x1
Assign irq 130 mask 0x1
Assign irq 131 mask 0x1
Assign irq 132 mask 0x1
Assign irq 133 mask 0x1
Assign irq 134 mask 0x1
Assign irq 135 mask 0x1
done.
And the result:
[root@wn-d-01 ~]# iperf -w 1M -i 5 -t 100 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 52039 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 5.51 GBytes 9.46 Gbits/sec
[ 3] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 20.0-25.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 25.0-30.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 30.0-35.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 35.0-40.0 sec 5.51 GBytes 9.47 Gbits/sec
^C[ 3] 0.0-41.6 sec 45.8 GBytes 9.47 Gbits/sec
[root@wn-d-01 ~]#
works both single and multiple streams and on one and both directions. Yay!. Now we just have to solve this on ALL nodes :)
Tuesday, December 4, 2012
OpenVZ latest kernels screw networking?
We've been puzzled recently with the overall performance of our networking. We run a mellanox fabric of SX1016 switches (four of them), which are 64 port 10G switches. We trunk three of them with eight ports to the fourth meaning that within a switch you can get up to 560Gbit/s, but between switches you're probably limited to 80Gbit/s. The way our environment is distributed you should get goods from both worlds.
However in practice we see most nodes traffic in the 0.5-1.5Gbit/s range. Which is really really odd. We'd been suspecting the switches etc for a long while and have about 4 different mellanox tickets open including both switch and 10G card firmwares in production now, that were largely created by Mellanox because of our issues.
But today as Ilja was debugging another issue at one of his deployments on 1G he noticed a weird network performance drop. Even with basic tests he couldn't get 1G line speed, not even close. The bugzilla ticket in question: http://bugzilla.openvz.org/show_bug.cgi?id=2454
He tried repeating the test here in our datacenter on some spare older nodes with 1G networking and was able to reproduce the issue, which disappeared with kernel downgrade. He also tested speeds between 1G and 10G and was getting really bad results. So next up we planned to test it inside 10G fabric. I ran a test between two 10G nodes and no matter how I tried I was hard pressed to see more than half a Gbit/s speeds. I then decided to test direct HN <-> VZ container tests as those have been shown to be able to run without overhead so for 10G we should be able to get 9+ Gbit/s easily.
Well that's a nice thought, but this is the actual result:
[root@wn-v-5492 ~]# iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 512 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 10.10.23.164 port 34441 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 235 MBytes 394 Mbits/sec
[ 3] 5.0-10.0 sec 910 MBytes 1.53 Gbits/sec
[ 3] 10.0-15.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 15.0-20.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 20.0-25.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 25.0-30.0 sec 1.05 GBytes 1.80 Gbits/sec
[ 3] 0.0-30.0 sec 5.29 GBytes 1.52 Gbits/sec
[root@wn-v-5492 ~]#
Now the interesting thing to notice here is that firstly it takes a moment for the speed to pick up, but we can live with that. Then however it's pretty hard capped at 1.8Gb/s. This node was doing nothing else at the time, therefore it wasn't resource constraining.
Another interesting thing of note here is that if you go back to Ilja's bugzilla post that was done before I even started testing, then there's a nice quote:
"Another symptom - reversing the iperf testing direction (using VM as iperf client, and remote physical server as iperf server) results in even worse results, which are quite consistent: ~180Mbps"
As he was testing on 1Gbit/s networks he saw a throughput of 18%. We just now tested the exact same thing on a totally different hardware and got 18% throughput. That's just too hard to believe to be a coincidence. So we started to look into kernels. The one we were running at that time was: 2.6.32-042stab059.7, which is a rebase of 2.6.32-279.1.1.el6. We downgraded to 2.6.32-042stab053.5, which is a rebase of 2.6.32-220.7.1.el6. Rerunning the test:
iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 512 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 10.10.23.164 port 33264 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 1.24 GBytes 2.13 Gbits/sec
[ 3] 5.0-10.0 sec 1.39 GBytes 2.39 Gbits/sec
[ 3] 10.0-15.0 sec 1.42 GBytes 2.43 Gbits/sec
[ 3] 15.0-20.0 sec 1.41 GBytes 2.42 Gbits/sec
[ 3] 20.0-25.0 sec 1.42 GBytes 2.44 Gbits/sec
[ 3] 25.0-30.0 sec 1.42 GBytes 2.44 Gbits/sec
[ 3] 0.0-30.0 sec 8.30 GBytes 2.38 Gbits/sec
So, improvement. But not quite line speed yet. Next up we restarted two of the nodes in vanilla CentOS 6.3 liveCD's to test vanilla OS kernels and see if that can get the speed up. Our local RTT times are about 0.05ms meaning that a simple calculation shows that we'll need a basic 64KB TCP window size to get to 10G therefore in theory no tuning is needed for the vanilla kernel:
Window = BW * RTT = 10Gbit/s * 0.05 ms = 1.25 GB/s * 0.05 ms = 1280MB/s * 0.05 ms = 1.28MB /ms * 0.05ms = 0.064 MB = 65.5 KB.
The vanilla kernel usually has tcp_rmem and tcp_wmem of 4K 16K 4M meaning that 16K default window would indeed give about 2.4Gb, but setting the window size larger should give you full 10G. However at least our first tests with vanilla kernel came up with nothing promising. We couldn't get past about 2.5Gbit/s and with multiple streams we were at best hitting 4.5Gbit/s. I'll continue the post when I continue the debugging tomorrow...
However in practice we see most nodes traffic in the 0.5-1.5Gbit/s range. Which is really really odd. We'd been suspecting the switches etc for a long while and have about 4 different mellanox tickets open including both switch and 10G card firmwares in production now, that were largely created by Mellanox because of our issues.
But today as Ilja was debugging another issue at one of his deployments on 1G he noticed a weird network performance drop. Even with basic tests he couldn't get 1G line speed, not even close. The bugzilla ticket in question: http://bugzilla.openvz.org/show_bug.cgi?id=2454
He tried repeating the test here in our datacenter on some spare older nodes with 1G networking and was able to reproduce the issue, which disappeared with kernel downgrade. He also tested speeds between 1G and 10G and was getting really bad results. So next up we planned to test it inside 10G fabric. I ran a test between two 10G nodes and no matter how I tried I was hard pressed to see more than half a Gbit/s speeds. I then decided to test direct HN <-> VZ container tests as those have been shown to be able to run without overhead so for 10G we should be able to get 9+ Gbit/s easily.
Well that's a nice thought, but this is the actual result:
[root@wn-v-5492 ~]# iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 512 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 10.10.23.164 port 34441 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 235 MBytes 394 Mbits/sec
[ 3] 5.0-10.0 sec 910 MBytes 1.53 Gbits/sec
[ 3] 10.0-15.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 15.0-20.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 20.0-25.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 25.0-30.0 sec 1.05 GBytes 1.80 Gbits/sec
[ 3] 0.0-30.0 sec 5.29 GBytes 1.52 Gbits/sec
[root@wn-v-5492 ~]#
Now the interesting thing to notice here is that firstly it takes a moment for the speed to pick up, but we can live with that. Then however it's pretty hard capped at 1.8Gb/s. This node was doing nothing else at the time, therefore it wasn't resource constraining.
Another interesting thing of note here is that if you go back to Ilja's bugzilla post that was done before I even started testing, then there's a nice quote:
"Another symptom - reversing the iperf testing direction (using VM as iperf client, and remote physical server as iperf server) results in even worse results, which are quite consistent: ~180Mbps"
As he was testing on 1Gbit/s networks he saw a throughput of 18%. We just now tested the exact same thing on a totally different hardware and got 18% throughput. That's just too hard to believe to be a coincidence. So we started to look into kernels. The one we were running at that time was: 2.6.32-042stab059.7, which is a rebase of 2.6.32-279.1.1.el6. We downgraded to 2.6.32-042stab053.5, which is a rebase of 2.6.32-220.7.1.el6. Rerunning the test:
iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 512 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 10.10.23.164 port 33264 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 1.24 GBytes 2.13 Gbits/sec
[ 3] 5.0-10.0 sec 1.39 GBytes 2.39 Gbits/sec
[ 3] 10.0-15.0 sec 1.42 GBytes 2.43 Gbits/sec
[ 3] 15.0-20.0 sec 1.41 GBytes 2.42 Gbits/sec
[ 3] 20.0-25.0 sec 1.42 GBytes 2.44 Gbits/sec
[ 3] 25.0-30.0 sec 1.42 GBytes 2.44 Gbits/sec
[ 3] 0.0-30.0 sec 8.30 GBytes 2.38 Gbits/sec
So, improvement. But not quite line speed yet. Next up we restarted two of the nodes in vanilla CentOS 6.3 liveCD's to test vanilla OS kernels and see if that can get the speed up. Our local RTT times are about 0.05ms meaning that a simple calculation shows that we'll need a basic 64KB TCP window size to get to 10G therefore in theory no tuning is needed for the vanilla kernel:
Window = BW * RTT = 10Gbit/s * 0.05 ms = 1.25 GB/s * 0.05 ms = 1280MB/s * 0.05 ms = 1.28MB /ms * 0.05ms = 0.064 MB = 65.5 KB.
The vanilla kernel usually has tcp_rmem and tcp_wmem of 4K 16K 4M meaning that 16K default window would indeed give about 2.4Gb, but setting the window size larger should give you full 10G. However at least our first tests with vanilla kernel came up with nothing promising. We couldn't get past about 2.5Gbit/s and with multiple streams we were at best hitting 4.5Gbit/s. I'll continue the post when I continue the debugging tomorrow...
Slurm and health check
For torque we used to run every 30 minutes a cronjob that checked if the node is working properly and if not it disabled them. With slurm I finally took the time to look for a way to have slurm automatically do it. Discovered it was extremely easy. You just need to add two config lines:
HealthCheckProgram=/home/test/testNode.sh
HealthCheckInterval=300
Now slurm runs every 5 minutes the health check program and if it gets stuck it's killed within 60s. The script has to perform a check and if a check fails it's got to take care of fixing it or disabling the node. It's done fairly simply. For example we check the presence of /hdfs directory for access to storage and if not ask slurm to drain the node:
# Test HDFS
NHDFS=`ls /hdfs|wc -l`
if [ $NHDFS -eq 0 ]; then
scontrol update NodeName=$NODE State=drain Reason="HDFS mount lost"
fi
You can add pretty much any check you want. The result is that sinfo nicely shows the drained nodes with reasons:
[root@slurm-1 ~]# sinfo -lN
Tue Dec 4 16:39:01 2012
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
wn-v-[2036,..,5384] 8 main* allocated 32 2:16:1 65536 0 1 (null) none
wn-v-[2072,..,7039] 19 main* drained 24+ 2:12+:1 49152+ 0 1 (null) HDFS mount lost
wn-v-[2324,..,6428] 6 main* idle 32 2:16:1 65536 0 1 (null) none
As you can see nodes that have lost the mount are automatically disabled and drained to be taken care of by you at some point.
HealthCheckProgram=/home/test/testNode.sh
HealthCheckInterval=300
Now slurm runs every 5 minutes the health check program and if it gets stuck it's killed within 60s. The script has to perform a check and if a check fails it's got to take care of fixing it or disabling the node. It's done fairly simply. For example we check the presence of /hdfs directory for access to storage and if not ask slurm to drain the node:
# Test HDFS
NHDFS=`ls /hdfs|wc -l`
if [ $NHDFS -eq 0 ]; then
scontrol update NodeName=$NODE State=drain Reason="HDFS mount lost"
fi
You can add pretty much any check you want. The result is that sinfo nicely shows the drained nodes with reasons:
[root@slurm-1 ~]# sinfo -lN
Tue Dec 4 16:39:01 2012
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
wn-v-[2036,..,5384] 8 main* allocated 32 2:16:1 65536 0 1 (null) none
wn-v-[2072,..,7039] 19 main* drained 24+ 2:12+:1 49152+ 0 1 (null) HDFS mount lost
wn-v-[2324,..,6428] 6 main* idle 32 2:16:1 65536 0 1 (null) none
As you can see nodes that have lost the mount are automatically disabled and drained to be taken care of by you at some point.
Monday, December 3, 2012
XFS vs EXT4
I've been hanging around the #gluster chat the past week to make sure we've gotten the gluster installation running smoothly and the one topic that keeps popping up is the underlying brick filesystem. The default recommendation of gluster people is XFS because of an EXT4 bug that causes major mayhem that got introduced in recent kernels and backported to RHEL 6 main branch. Read more about that here:
However I'd not jump to XFS that fast myself. In August-September this year after we rebuilt the datacenter for new power grid, new interconnect and lots of new nodes we were also planning to expand and change our storage to a more distributed model. The system we had been using was 15 storage nodes that in total provided raid5 volumes in total of 750TB. Now instead we planned to move to using pure disks directly fed to Hadoop as data directories and have every block in hadoop replicated at least twice. Now that would cost us ca 30-40% of good capacity (we'd regain the parity and hot spare drives so that alleviated it somewhat), but we'd also add all the workernode disks (each node had 3x 3TB drives of which ca 1TB was dedicated to local scratch space + OS) so the overall raw capacity went up to 2PB giving us 1PB of usable disk space in a distirbuted RAID 10 like environment.
All was good and nice except the power grid wasn't initially well balanced. So when we cranked up the job counts we sometimes hit a region where the breaker would turn off a block of 16-25 servers. As now all servers were also part of the storage that was annoying especially because those were real breakers so someone had to go physically on-site, enable the breaker and try to balance the power grid (luckily we had power phase usage history so could see which phases were over/under balances).
The bad part about the power loss was that it seems XFS was not at all safe against power loss. I don't mind losing the files that were actively being written (that means new files that would never get marked as completed therefore automatically triggering a redownload). However the bad part was that Hadoop claimed every time we lost a block of nodes to have lost anywhere from 300k to 1M blocks. Each block is 128MB. Now this was an annoyance most of the time because every block is double replicated so it just meant a lot of intra nodes traffic until the under replicated blocks got re-replicated. But with that many lost blocks it also meant that at times BOTH copies of a block were killed. When that happened we had data corruption and lost files.
Upon closer inspection we found that blocks were marked lost because the datanodes coming up after the reboot couldn't validate them. The reason was that either the datafile itself had been lost and replaced by a 0 byte file or the metadata file had. If either was lost, the block was considered lost. In theory we may have been able to recover some of them because if we lost both blocks it may well have happened that we lost the datafile for one and the block for the other, but it was way too much manual labor to be worth it.
After about 10 power losses and hundreds of TB of lost data we made a decision to move away from XFS even though this porblem of 0 size files is not supposed to be part of newer kernels and we were running CentOS 6.3 with OpenVZ kernel (not the newest at the time, but still 2.6.32-x kernel that should have all the XFS fixes). So after about a month of migration with decommissioning a block of datanodes, reformatting to ext4 and moving on we now run everything on EXT4. As it is we've not been hit by the above mentioned bug in hadoop related operations (we think) and it'll be something to think when operating glusterfs. Right now it's operated from the storage nodes that still run Scientific Linux 5.7 with kernels that are not affected, but if we plan to move gluster to a similar format like Hadoop it'll mean introducing bricks from newer kernel ext4 partitions and that may be an issue.
Lazy umount led to no NFS locking
Today we had one of the gluster NFS servers go wacko. No better name for it. Using the console it was nice and responsive as well as could talk with outside world, but seems to have had NFS issues. So we sent it to reboot and remounted our /home directory from another gluster node. To do that I performed a lazy umount:
# umount -l /home
the idea of course is to get the fs unmounted, then mount the correct one and as soon as different tasks get their i/o finalized they end up using the new filesystem already. However seems that's not a good idea as even though majority of functionality recovered without any users having to do anything it somehow screwed up the locking. Any task that wanted to set a lock on a file got an access denied error.
Restarting the VM didn't help. On the headnode we saw in dmesg:
[1038502.797998] lockd: cannot monitor 192.168.1.244
[1038508.865280] lockd: cannot monitor 192.168.1.244
which quickly pointed to nfslock service issue. Restarting it fixed everything. Let me now go and bang my head against the wall for a while.
# umount -l /home
the idea of course is to get the fs unmounted, then mount the correct one and as soon as different tasks get their i/o finalized they end up using the new filesystem already. However seems that's not a good idea as even though majority of functionality recovered without any users having to do anything it somehow screwed up the locking. Any task that wanted to set a lock on a file got an access denied error.
Restarting the VM didn't help. On the headnode we saw in dmesg:
[1038502.797998] lockd: cannot monitor 192.168.1.244
[1038508.865280] lockd: cannot monitor 192.168.1.244
which quickly pointed to nfslock service issue. Restarting it fixed everything. Let me now go and bang my head against the wall for a while.
Subscribe to:
Posts (Atom)