Today I ran into a situation where gluster was stuck. You could check info and status, but nothing more spectacular. When you tried enabling profiling or heal you'd get in log:
[2012-12-13 15:40:07.881363] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: c3ce6b9c-6297-4e77-924c-b44e2c13e58f, lock held by: c3ce6b9c-6297-4e77-924c-b44e2c13e58f
As can be seen the node itself can't get a lock that it itself is holding. Googling around it seems that the error comes from a command that failed to complete correctly and didn't release the lock. The web talks that this lock should auto-disappear in 30 minutes, but that's not the best solution. Also I was able to get rid of the lock a few times with
gluster volume statedump home0
but not always. When I managed to clear the lock things seemed to work (profiling started and gave info etc), but when I ran again
gluster volume heal home0
it locked up again. In the end the only help that I got from #gluster was to unmount all clients, stop all gluster processes and restart them. That indeed cleared the locks and heal command also works, but it's not really a solution I'd like to ever see again. I've written to gluster-users list and will update this entry if I get an actual way that this should have been cleared.
Thursday, December 13, 2012
Wednesday, December 5, 2012
Network debugging for 10Gbit/s
So today we decided to take the network testing seriously. We connected two servers with 1G network to the generic infrastructure (to access software from internet) and then a direct cable between them for 10G. This way we should be able to test exclusively the OS + driver + NIC fw level that we can get to 10Gbit and from there expand the test to include a switch.
To make sure we didn't have any stale software we reinstalled both nodes with CentOS 6.3 with no kernel tunings. We also compared sysctl -a output on the two servers and though there were minor differences none of them should prove relevant.
We then launched a few baseline tests. First of all running iperf locally inside the server to see how much the server itself can handle. With 256KB window size both did at least 17Gbit/s
[root@wn-d-01 ~]# iperf -w 256k -c 192.168.2.1 -i 1
------------------------------------------------------------
Client connecting to 192.168.2.1, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 55968 connected with 192.168.2.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1.77 GBytes 15.2 Gbits/sec
[ 3] 1.0- 2.0 sec 2.37 GBytes 20.4 Gbits/sec
[ 3] 2.0- 3.0 sec 2.42 GBytes 20.8 Gbits/sec
[ 3] 3.0- 4.0 sec 2.40 GBytes 20.6 Gbits/sec
[ 3] 4.0- 5.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 5.0- 6.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 6.0- 7.0 sec 2.39 GBytes 20.6 Gbits/sec
[ 3] 7.0- 8.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 8.0- 9.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 9.0-10.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 0.0-10.0 sec 23.3 GBytes 20.0 Gbits/sec
[root@wn-d-01 ~]#
Running then the test between them firstly on the 1Gbit/s to see that there is no generic foul play at work that cuts speed to 25% or 33% we saw nicely 942Mbit/s speeds:
[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.2.98
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 53338 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 112 MBytes 943 Mbits/sec
[ 3] 1.0- 2.0 sec 112 MBytes 942 Mbits/sec
[ 3] 2.0- 3.0 sec 112 MBytes 942 Mbits/sec
[ 3] 3.0- 4.0 sec 112 MBytes 942 Mbits/sec
[ 3] 4.0- 5.0 sec 112 MBytes 942 Mbits/sec
[ 3] 5.0- 6.0 sec 112 MBytes 941 Mbits/sec
[ 3] 6.0- 7.0 sec 112 MBytes 942 Mbits/sec
[ 3] 7.0- 8.0 sec 112 MBytes 942 Mbits/sec
[ 3] 8.0- 9.0 sec 112 MBytes 942 Mbits/sec
[ 3] 9.0-10.0 sec 112 MBytes 942 Mbits/sec
[ 3] 0.0-10.0 sec 1.10 GBytes 942 Mbits/sec
[root@wn-d-01 ~]#
So now we fired up the default kernel driver for our Mellanox 10G card and tested the link firstly with ping:
[root@wn-d-98 ~]# ping 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.725 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.177 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.187 ms
So an RTT of 0.2ms means that with 256KB window size you get 9.8Gbit/s. So let's see if that actually works (remember, it's direct attached cable, nothing else running on the servers):
[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 51131 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 290 MBytes 2.43 Gbits/sec
[ 3] 1.0- 2.0 sec 296 MBytes 2.49 Gbits/sec
[ 3] 2.0- 3.0 sec 191 MBytes 1.61 Gbits/sec
[ 3] 3.0- 4.0 sec 320 MBytes 2.68 Gbits/sec
[ 3] 4.0- 5.0 sec 232 MBytes 1.95 Gbits/sec
[ 3] 5.0- 6.0 sec 161 MBytes 1.35 Gbits/sec
[ 3] 6.0- 7.0 sec 135 MBytes 1.13 Gbits/sec
[ 3] 7.0- 8.0 sec 249 MBytes 2.09 Gbits/sec
[ 3] 8.0- 9.0 sec 224 MBytes 1.88 Gbits/sec
[ 3] 9.0-10.0 sec 182 MBytes 1.53 Gbits/sec
[ 3] 0.0-10.0 sec 2.23 GBytes 1.91 Gbits/sec
[root@wn-d-01 ~]#
Not even close. So next up we installed Mellanox official kernel modules. With those we could also increase the window size to 1-2MB etc (which the default somehow capped at 256KB). Though this shouldn't matter. The first test looked promising the first few seconds:
[root@wn-d-01 ~]# iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58336 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1021 MBytes 8.57 Gbits/sec
[ 3] 1.0- 2.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 2.0- 3.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 3.0- 4.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 4.0- 5.0 sec 933 MBytes 7.82 Gbits/sec
[ 3] 5.0- 6.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 6.0- 7.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 7.0- 8.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 8.0- 9.0 sec 276 MBytes 2.32 Gbits/sec
[ 3] 9.0-10.0 sec 277 MBytes 2.33 Gbits/sec
No matter how hard we tried we couldn't repeat the 9.47Gb/s speeds. Digging into Mellanox network performance tuning guide I first set the default kernel parameters according to them to higher values however that had absolutely no impact on throughput.
The tunings they recommend:
# disable TCP timestamps
sysctl -w net.ipv4.tcp_timestamps=0
# Disable the TCP selective acks option for better CPU utilization:
sysctl -w net.ipv4.tcp_sack=0
# Increase the maximum length of processor input queues:
sysctl -w net.core.netdev_max_backlog=250000
# Increase the TCP maximum and default buffer sizes using setsockopt():
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216
sysctl -w net.core.optmem_max=16777216
# Increase memory thresholds to prevent packet dropping:
sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"
# Increase Linux’s auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are:
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
# Enable low latency mode for TCP:
sysctl -w net.ipv4.tcp_low_latency=1
However what did impact somewhat is turning off adaptive interrupt moderation. Though only for a short time. We're getting 7Gbit/s from one node to the other, but the other direction was able to do 7 Gbit/s only for a few seconds before hiccuping and going down to 2.3Gbit/s again:
iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58341 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 856 MBytes 7.18 Gbits/sec
[ 3] 1.0- 2.0 sec 855 MBytes 7.17 Gbits/sec
[ 3] 2.0- 3.0 sec 879 MBytes 7.37 Gbits/sec
[ 3] 3.0- 4.0 sec 902 MBytes 7.57 Gbits/sec
[ 3] 4.0- 5.0 sec 854 MBytes 7.16 Gbits/sec
[ 3] 5.0- 6.0 sec 203 MBytes 1.71 Gbits/sec
[ 3] 6.0- 7.0 sec 306 MBytes 2.56 Gbits/sec
[ 3] 7.0- 8.0 sec 852 MBytes 7.15 Gbits/sec
[ 3] 8.0- 9.0 sec 799 MBytes 6.70 Gbits/sec
[ 3] 9.0-10.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 10.0-11.0 sec 323 MBytes 2.71 Gbits/sec
[ 3] 11.0-12.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 12.0-13.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 13.0-14.0 sec 277 MBytes 2.32 Gbits/sec
...
Reading further I changed the adaptive interrupt moderation to this:
ethtool -C eth2 adaptive-rx off rx-usecs 32 rx-frames 32
Running two parallel streams and bidirectional test gives a pretty good result.
wn-d-98 as client, wn-d-01 as server:
# iperf -w 1M -i 5 -t 20 -c 192.168.10.1 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.1, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.2 port 56125 connected with 192.168.10.1 port 5001
[ 3] local 192.168.10.2 port 56126 connected with 192.168.10.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec
The other direction:
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58460 connected with 192.168.10.2 port 5001
[ 4] local 192.168.10.1 port 58459 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 10.0-15.0 sec 5.52 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 15.0-20.0 sec 5.52 GBytes 9.48 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.73 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec
So with two streams we can saturate the network. Testing again with 1 stream we got d-98 -> d-01 at 9.5Gb/s, but the reverse was at 2.3 Gb/s. Running d-01 -> d-98 at -P 2 got to 9.5G again. Bizarre. The first test now is to see what happens after reboot.
After reboot we see pretty much the same state. Single stream sucks, two streams get 9.5G at least initially and then speeds slow down if you run them parallel in both directions. The tuning script was not made restart safe exactly to see if it had any effect. Same for the adaptive interrupts. Setting both up again we get back to the previous state where single stream speeds are 2-3Gbit/s when run at the same time and two parallel streams get 9.5G in both directions.
Update: it seems the traffic speed is relatively unstable. I tried to move to OpenVZ kernel as well and got the result that with the same tunings and adaptive interrupts I got the following result (notice the varying speed and how it jumps when the other direction traffic is sent, which I've highlighted in red):
# iperf -w 1M -i 5 -t 100 -c 192.168.10.2 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.1 port 52035 connected with 192.168.10.2 port 5001
[ 3] local 192.168.10.1 port 52036 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 0.0- 5.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[ 3] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[SUM] 5.0-10.0 sec 3.53 GBytes 6.06 Gbits/sec
[ 4] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 10.0-15.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 15.0-20.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 15.0-20.0 sec 693 MBytes 1.16 Gbits/sec
[SUM] 15.0-20.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 20.0-25.0 sec 694 MBytes 1.17 Gbits/sec
[ 3] 20.0-25.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 20.0-25.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47385
[ 6] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47384
[ 4] 25.0-30.0 sec 1.85 GBytes 3.18 Gbits/sec
[ 3] 25.0-30.0 sec 2.42 GBytes 4.16 Gbits/sec
[SUM] 25.0-30.0 sec 4.28 GBytes 7.35 Gbits/sec
[ 3] 30.0-35.0 sec 2.36 GBytes 4.06 Gbits/sec
[ 4] 30.0-35.0 sec 1.70 GBytes 2.93 Gbits/sec
[SUM] 30.0-35.0 sec 4.06 GBytes 6.98 Gbits/sec
[ 3] 35.0-40.0 sec 2.83 GBytes 4.86 Gbits/sec
[ 4] 35.0-40.0 sec 2.10 GBytes 3.61 Gbits/sec
[SUM] 35.0-40.0 sec 4.93 GBytes 8.47 Gbits/sec
[ 4] 40.0-45.0 sec 820 MBytes 1.38 Gbits/sec
[ 3] 40.0-45.0 sec 1.37 GBytes 2.36 Gbits/sec
[SUM] 40.0-45.0 sec 2.17 GBytes 3.73 Gbits/sec
[ 4] 45.0-50.0 sec 2.30 GBytes 3.96 Gbits/sec
[ 3] 45.0-50.0 sec 3.02 GBytes 5.20 Gbits/sec
[SUM] 45.0-50.0 sec 5.33 GBytes 9.15 Gbits/sec
[ 4] 50.0-55.0 sec 1.37 GBytes 2.36 Gbits/sec
[ 3] 50.0-55.0 sec 2.00 GBytes 3.43 Gbits/sec
[SUM] 50.0-55.0 sec 3.37 GBytes 5.79 Gbits/sec
[ 4] 0.0-30.9 sec 12.4 GBytes 3.46 Gbits/sec
[ 6] 0.0-30.9 sec 12.6 GBytes 3.50 Gbits/sec
[SUM] 0.0-30.9 sec 25.0 GBytes 6.96 Gbits/sec
[ 4] 55.0-60.0 sec 2.63 GBytes 4.51 Gbits/sec
[ 3] 55.0-60.0 sec 2.89 GBytes 4.96 Gbits/sec
[SUM] 55.0-60.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[ 3] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[SUM] 60.0-65.0 sec 5.20 GBytes 8.94 Gbits/sec
[ 4] 65.0-70.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 65.0-70.0 sec 696 MBytes 1.17 Gbits/sec
[SUM] 65.0-70.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[ 3] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[SUM] 70.0-75.0 sec 1.67 GBytes 2.88 Gbits/sec
[ 4] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 75.0-80.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 80.0-85.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 3] 85.0-90.0 sec 694 MBytes 1.16 Gbits/sec
[ 4] 85.0-90.0 sec 697 MBytes 1.17 Gbits/sec
[SUM] 85.0-90.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 90.0-95.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 90.0-95.0 sec 694 MBytes 1.17 Gbits/sec
[SUM] 90.0-95.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 95.0-100.0 sec 696 MBytes 1.17 Gbits/sec
[ 4] 0.0-100.0 sec 32.6 GBytes 2.80 Gbits/sec
[ 3] 95.0-100.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 95.0-100.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 3] 0.0-100.0 sec 36.7 GBytes 3.15 Gbits/sec
[SUM] 0.0-100.0 sec 69.3 GBytes 5.95 Gbits/sec
[root@wn-d-01 mlnx_en-1.5.9]#
reducing the rx-usec and rx-frames to 0 as recommended in the performance guide I can't really get past 3Gbit/s. So it does point towards some issue with interrupts.
So as a final test as Mellanox driver package provide scripts to set IRQ affinity for mellanox interfaces I tried fixing it and retesting. On both nodes:
# set_irq_affinity_cpulist.sh 0 eth2
-------------------------------------
Optimizing IRQs for Single port traffic
-------------------------------------
Discovered irqs for eth2: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
Assign irq 120 mask 0x1
Assign irq 121 mask 0x1
Assign irq 122 mask 0x1
Assign irq 123 mask 0x1
Assign irq 124 mask 0x1
Assign irq 125 mask 0x1
Assign irq 126 mask 0x1
Assign irq 127 mask 0x1
Assign irq 128 mask 0x1
Assign irq 129 mask 0x1
Assign irq 130 mask 0x1
Assign irq 131 mask 0x1
Assign irq 132 mask 0x1
Assign irq 133 mask 0x1
Assign irq 134 mask 0x1
Assign irq 135 mask 0x1
done.
And the result:
[root@wn-d-01 ~]# iperf -w 1M -i 5 -t 100 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 52039 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 5.51 GBytes 9.46 Gbits/sec
[ 3] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 20.0-25.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 25.0-30.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 30.0-35.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 35.0-40.0 sec 5.51 GBytes 9.47 Gbits/sec
^C[ 3] 0.0-41.6 sec 45.8 GBytes 9.47 Gbits/sec
[root@wn-d-01 ~]#
works both single and multiple streams and on one and both directions. Yay!. Now we just have to solve this on ALL nodes :)
To make sure we didn't have any stale software we reinstalled both nodes with CentOS 6.3 with no kernel tunings. We also compared sysctl -a output on the two servers and though there were minor differences none of them should prove relevant.
We then launched a few baseline tests. First of all running iperf locally inside the server to see how much the server itself can handle. With 256KB window size both did at least 17Gbit/s
[root@wn-d-01 ~]# iperf -w 256k -c 192.168.2.1 -i 1
------------------------------------------------------------
Client connecting to 192.168.2.1, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 55968 connected with 192.168.2.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1.77 GBytes 15.2 Gbits/sec
[ 3] 1.0- 2.0 sec 2.37 GBytes 20.4 Gbits/sec
[ 3] 2.0- 3.0 sec 2.42 GBytes 20.8 Gbits/sec
[ 3] 3.0- 4.0 sec 2.40 GBytes 20.6 Gbits/sec
[ 3] 4.0- 5.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 5.0- 6.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 6.0- 7.0 sec 2.39 GBytes 20.6 Gbits/sec
[ 3] 7.0- 8.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 8.0- 9.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 9.0-10.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 0.0-10.0 sec 23.3 GBytes 20.0 Gbits/sec
[root@wn-d-01 ~]#
Running then the test between them firstly on the 1Gbit/s to see that there is no generic foul play at work that cuts speed to 25% or 33% we saw nicely 942Mbit/s speeds:
[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.2.98
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 53338 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 112 MBytes 943 Mbits/sec
[ 3] 1.0- 2.0 sec 112 MBytes 942 Mbits/sec
[ 3] 2.0- 3.0 sec 112 MBytes 942 Mbits/sec
[ 3] 3.0- 4.0 sec 112 MBytes 942 Mbits/sec
[ 3] 4.0- 5.0 sec 112 MBytes 942 Mbits/sec
[ 3] 5.0- 6.0 sec 112 MBytes 941 Mbits/sec
[ 3] 6.0- 7.0 sec 112 MBytes 942 Mbits/sec
[ 3] 7.0- 8.0 sec 112 MBytes 942 Mbits/sec
[ 3] 8.0- 9.0 sec 112 MBytes 942 Mbits/sec
[ 3] 9.0-10.0 sec 112 MBytes 942 Mbits/sec
[ 3] 0.0-10.0 sec 1.10 GBytes 942 Mbits/sec
[root@wn-d-01 ~]#
So now we fired up the default kernel driver for our Mellanox 10G card and tested the link firstly with ping:
[root@wn-d-98 ~]# ping 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.725 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.177 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.187 ms
So an RTT of 0.2ms means that with 256KB window size you get 9.8Gbit/s. So let's see if that actually works (remember, it's direct attached cable, nothing else running on the servers):
[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 51131 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 290 MBytes 2.43 Gbits/sec
[ 3] 1.0- 2.0 sec 296 MBytes 2.49 Gbits/sec
[ 3] 2.0- 3.0 sec 191 MBytes 1.61 Gbits/sec
[ 3] 3.0- 4.0 sec 320 MBytes 2.68 Gbits/sec
[ 3] 4.0- 5.0 sec 232 MBytes 1.95 Gbits/sec
[ 3] 5.0- 6.0 sec 161 MBytes 1.35 Gbits/sec
[ 3] 6.0- 7.0 sec 135 MBytes 1.13 Gbits/sec
[ 3] 7.0- 8.0 sec 249 MBytes 2.09 Gbits/sec
[ 3] 8.0- 9.0 sec 224 MBytes 1.88 Gbits/sec
[ 3] 9.0-10.0 sec 182 MBytes 1.53 Gbits/sec
[ 3] 0.0-10.0 sec 2.23 GBytes 1.91 Gbits/sec
[root@wn-d-01 ~]#
Not even close. So next up we installed Mellanox official kernel modules. With those we could also increase the window size to 1-2MB etc (which the default somehow capped at 256KB). Though this shouldn't matter. The first test looked promising the first few seconds:
[root@wn-d-01 ~]# iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58336 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1021 MBytes 8.57 Gbits/sec
[ 3] 1.0- 2.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 2.0- 3.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 3.0- 4.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 4.0- 5.0 sec 933 MBytes 7.82 Gbits/sec
[ 3] 5.0- 6.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 6.0- 7.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 7.0- 8.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 8.0- 9.0 sec 276 MBytes 2.32 Gbits/sec
[ 3] 9.0-10.0 sec 277 MBytes 2.33 Gbits/sec
No matter how hard we tried we couldn't repeat the 9.47Gb/s speeds. Digging into Mellanox network performance tuning guide I first set the default kernel parameters according to them to higher values however that had absolutely no impact on throughput.
The tunings they recommend:
# disable TCP timestamps
sysctl -w net.ipv4.tcp_timestamps=0
# Disable the TCP selective acks option for better CPU utilization:
sysctl -w net.ipv4.tcp_sack=0
# Increase the maximum length of processor input queues:
sysctl -w net.core.netdev_max_backlog=250000
# Increase the TCP maximum and default buffer sizes using setsockopt():
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216
sysctl -w net.core.optmem_max=16777216
# Increase memory thresholds to prevent packet dropping:
sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"
# Increase Linux’s auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are:
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
# Enable low latency mode for TCP:
sysctl -w net.ipv4.tcp_low_latency=1
However what did impact somewhat is turning off adaptive interrupt moderation. Though only for a short time. We're getting 7Gbit/s from one node to the other, but the other direction was able to do 7 Gbit/s only for a few seconds before hiccuping and going down to 2.3Gbit/s again:
iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58341 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 856 MBytes 7.18 Gbits/sec
[ 3] 1.0- 2.0 sec 855 MBytes 7.17 Gbits/sec
[ 3] 2.0- 3.0 sec 879 MBytes 7.37 Gbits/sec
[ 3] 3.0- 4.0 sec 902 MBytes 7.57 Gbits/sec
[ 3] 4.0- 5.0 sec 854 MBytes 7.16 Gbits/sec
[ 3] 5.0- 6.0 sec 203 MBytes 1.71 Gbits/sec
[ 3] 6.0- 7.0 sec 306 MBytes 2.56 Gbits/sec
[ 3] 7.0- 8.0 sec 852 MBytes 7.15 Gbits/sec
[ 3] 8.0- 9.0 sec 799 MBytes 6.70 Gbits/sec
[ 3] 9.0-10.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 10.0-11.0 sec 323 MBytes 2.71 Gbits/sec
[ 3] 11.0-12.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 12.0-13.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 13.0-14.0 sec 277 MBytes 2.32 Gbits/sec
...
Reading further I changed the adaptive interrupt moderation to this:
ethtool -C eth2 adaptive-rx off rx-usecs 32 rx-frames 32
Running two parallel streams and bidirectional test gives a pretty good result.
wn-d-98 as client, wn-d-01 as server:
# iperf -w 1M -i 5 -t 20 -c 192.168.10.1 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.1, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.2 port 56125 connected with 192.168.10.1 port 5001
[ 3] local 192.168.10.2 port 56126 connected with 192.168.10.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec
The other direction:
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58460 connected with 192.168.10.2 port 5001
[ 4] local 192.168.10.1 port 58459 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 10.0-15.0 sec 5.52 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 15.0-20.0 sec 5.52 GBytes 9.48 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.73 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec
So with two streams we can saturate the network. Testing again with 1 stream we got d-98 -> d-01 at 9.5Gb/s, but the reverse was at 2.3 Gb/s. Running d-01 -> d-98 at -P 2 got to 9.5G again. Bizarre. The first test now is to see what happens after reboot.
After reboot we see pretty much the same state. Single stream sucks, two streams get 9.5G at least initially and then speeds slow down if you run them parallel in both directions. The tuning script was not made restart safe exactly to see if it had any effect. Same for the adaptive interrupts. Setting both up again we get back to the previous state where single stream speeds are 2-3Gbit/s when run at the same time and two parallel streams get 9.5G in both directions.
Update: it seems the traffic speed is relatively unstable. I tried to move to OpenVZ kernel as well and got the result that with the same tunings and adaptive interrupts I got the following result (notice the varying speed and how it jumps when the other direction traffic is sent, which I've highlighted in red):
# iperf -w 1M -i 5 -t 100 -c 192.168.10.2 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.1 port 52035 connected with 192.168.10.2 port 5001
[ 3] local 192.168.10.1 port 52036 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 0.0- 5.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[ 3] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[SUM] 5.0-10.0 sec 3.53 GBytes 6.06 Gbits/sec
[ 4] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 10.0-15.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 15.0-20.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 15.0-20.0 sec 693 MBytes 1.16 Gbits/sec
[SUM] 15.0-20.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 20.0-25.0 sec 694 MBytes 1.17 Gbits/sec
[ 3] 20.0-25.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 20.0-25.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47385
[ 6] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47384
[ 4] 25.0-30.0 sec 1.85 GBytes 3.18 Gbits/sec
[ 3] 25.0-30.0 sec 2.42 GBytes 4.16 Gbits/sec
[SUM] 25.0-30.0 sec 4.28 GBytes 7.35 Gbits/sec
[ 3] 30.0-35.0 sec 2.36 GBytes 4.06 Gbits/sec
[ 4] 30.0-35.0 sec 1.70 GBytes 2.93 Gbits/sec
[SUM] 30.0-35.0 sec 4.06 GBytes 6.98 Gbits/sec
[ 3] 35.0-40.0 sec 2.83 GBytes 4.86 Gbits/sec
[ 4] 35.0-40.0 sec 2.10 GBytes 3.61 Gbits/sec
[SUM] 35.0-40.0 sec 4.93 GBytes 8.47 Gbits/sec
[ 4] 40.0-45.0 sec 820 MBytes 1.38 Gbits/sec
[ 3] 40.0-45.0 sec 1.37 GBytes 2.36 Gbits/sec
[SUM] 40.0-45.0 sec 2.17 GBytes 3.73 Gbits/sec
[ 4] 45.0-50.0 sec 2.30 GBytes 3.96 Gbits/sec
[ 3] 45.0-50.0 sec 3.02 GBytes 5.20 Gbits/sec
[SUM] 45.0-50.0 sec 5.33 GBytes 9.15 Gbits/sec
[ 4] 50.0-55.0 sec 1.37 GBytes 2.36 Gbits/sec
[ 3] 50.0-55.0 sec 2.00 GBytes 3.43 Gbits/sec
[SUM] 50.0-55.0 sec 3.37 GBytes 5.79 Gbits/sec
[ 4] 0.0-30.9 sec 12.4 GBytes 3.46 Gbits/sec
[ 6] 0.0-30.9 sec 12.6 GBytes 3.50 Gbits/sec
[SUM] 0.0-30.9 sec 25.0 GBytes 6.96 Gbits/sec
[ 4] 55.0-60.0 sec 2.63 GBytes 4.51 Gbits/sec
[ 3] 55.0-60.0 sec 2.89 GBytes 4.96 Gbits/sec
[SUM] 55.0-60.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[ 3] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[SUM] 60.0-65.0 sec 5.20 GBytes 8.94 Gbits/sec
[ 4] 65.0-70.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 65.0-70.0 sec 696 MBytes 1.17 Gbits/sec
[SUM] 65.0-70.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[ 3] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[SUM] 70.0-75.0 sec 1.67 GBytes 2.88 Gbits/sec
[ 4] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 75.0-80.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 80.0-85.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 3] 85.0-90.0 sec 694 MBytes 1.16 Gbits/sec
[ 4] 85.0-90.0 sec 697 MBytes 1.17 Gbits/sec
[SUM] 85.0-90.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 90.0-95.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 90.0-95.0 sec 694 MBytes 1.17 Gbits/sec
[SUM] 90.0-95.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 95.0-100.0 sec 696 MBytes 1.17 Gbits/sec
[ 4] 0.0-100.0 sec 32.6 GBytes 2.80 Gbits/sec
[ 3] 95.0-100.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 95.0-100.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 3] 0.0-100.0 sec 36.7 GBytes 3.15 Gbits/sec
[SUM] 0.0-100.0 sec 69.3 GBytes 5.95 Gbits/sec
[root@wn-d-01 mlnx_en-1.5.9]#
reducing the rx-usec and rx-frames to 0 as recommended in the performance guide I can't really get past 3Gbit/s. So it does point towards some issue with interrupts.
So as a final test as Mellanox driver package provide scripts to set IRQ affinity for mellanox interfaces I tried fixing it and retesting. On both nodes:
# set_irq_affinity_cpulist.sh 0 eth2
-------------------------------------
Optimizing IRQs for Single port traffic
-------------------------------------
Discovered irqs for eth2: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
Assign irq 120 mask 0x1
Assign irq 121 mask 0x1
Assign irq 122 mask 0x1
Assign irq 123 mask 0x1
Assign irq 124 mask 0x1
Assign irq 125 mask 0x1
Assign irq 126 mask 0x1
Assign irq 127 mask 0x1
Assign irq 128 mask 0x1
Assign irq 129 mask 0x1
Assign irq 130 mask 0x1
Assign irq 131 mask 0x1
Assign irq 132 mask 0x1
Assign irq 133 mask 0x1
Assign irq 134 mask 0x1
Assign irq 135 mask 0x1
done.
And the result:
[root@wn-d-01 ~]# iperf -w 1M -i 5 -t 100 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 52039 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 5.51 GBytes 9.46 Gbits/sec
[ 3] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 20.0-25.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 25.0-30.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 30.0-35.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 35.0-40.0 sec 5.51 GBytes 9.47 Gbits/sec
^C[ 3] 0.0-41.6 sec 45.8 GBytes 9.47 Gbits/sec
[root@wn-d-01 ~]#
works both single and multiple streams and on one and both directions. Yay!. Now we just have to solve this on ALL nodes :)
Tuesday, December 4, 2012
OpenVZ latest kernels screw networking?
We've been puzzled recently with the overall performance of our networking. We run a mellanox fabric of SX1016 switches (four of them), which are 64 port 10G switches. We trunk three of them with eight ports to the fourth meaning that within a switch you can get up to 560Gbit/s, but between switches you're probably limited to 80Gbit/s. The way our environment is distributed you should get goods from both worlds.
However in practice we see most nodes traffic in the 0.5-1.5Gbit/s range. Which is really really odd. We'd been suspecting the switches etc for a long while and have about 4 different mellanox tickets open including both switch and 10G card firmwares in production now, that were largely created by Mellanox because of our issues.
But today as Ilja was debugging another issue at one of his deployments on 1G he noticed a weird network performance drop. Even with basic tests he couldn't get 1G line speed, not even close. The bugzilla ticket in question: http://bugzilla.openvz.org/show_bug.cgi?id=2454
He tried repeating the test here in our datacenter on some spare older nodes with 1G networking and was able to reproduce the issue, which disappeared with kernel downgrade. He also tested speeds between 1G and 10G and was getting really bad results. So next up we planned to test it inside 10G fabric. I ran a test between two 10G nodes and no matter how I tried I was hard pressed to see more than half a Gbit/s speeds. I then decided to test direct HN <-> VZ container tests as those have been shown to be able to run without overhead so for 10G we should be able to get 9+ Gbit/s easily.
Well that's a nice thought, but this is the actual result:
[root@wn-v-5492 ~]# iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 512 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 10.10.23.164 port 34441 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 235 MBytes 394 Mbits/sec
[ 3] 5.0-10.0 sec 910 MBytes 1.53 Gbits/sec
[ 3] 10.0-15.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 15.0-20.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 20.0-25.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 25.0-30.0 sec 1.05 GBytes 1.80 Gbits/sec
[ 3] 0.0-30.0 sec 5.29 GBytes 1.52 Gbits/sec
[root@wn-v-5492 ~]#
Now the interesting thing to notice here is that firstly it takes a moment for the speed to pick up, but we can live with that. Then however it's pretty hard capped at 1.8Gb/s. This node was doing nothing else at the time, therefore it wasn't resource constraining.
Another interesting thing of note here is that if you go back to Ilja's bugzilla post that was done before I even started testing, then there's a nice quote:
"Another symptom - reversing the iperf testing direction (using VM as iperf client, and remote physical server as iperf server) results in even worse results, which are quite consistent: ~180Mbps"
As he was testing on 1Gbit/s networks he saw a throughput of 18%. We just now tested the exact same thing on a totally different hardware and got 18% throughput. That's just too hard to believe to be a coincidence. So we started to look into kernels. The one we were running at that time was: 2.6.32-042stab059.7, which is a rebase of 2.6.32-279.1.1.el6. We downgraded to 2.6.32-042stab053.5, which is a rebase of 2.6.32-220.7.1.el6. Rerunning the test:
iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 512 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 10.10.23.164 port 33264 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 1.24 GBytes 2.13 Gbits/sec
[ 3] 5.0-10.0 sec 1.39 GBytes 2.39 Gbits/sec
[ 3] 10.0-15.0 sec 1.42 GBytes 2.43 Gbits/sec
[ 3] 15.0-20.0 sec 1.41 GBytes 2.42 Gbits/sec
[ 3] 20.0-25.0 sec 1.42 GBytes 2.44 Gbits/sec
[ 3] 25.0-30.0 sec 1.42 GBytes 2.44 Gbits/sec
[ 3] 0.0-30.0 sec 8.30 GBytes 2.38 Gbits/sec
So, improvement. But not quite line speed yet. Next up we restarted two of the nodes in vanilla CentOS 6.3 liveCD's to test vanilla OS kernels and see if that can get the speed up. Our local RTT times are about 0.05ms meaning that a simple calculation shows that we'll need a basic 64KB TCP window size to get to 10G therefore in theory no tuning is needed for the vanilla kernel:
Window = BW * RTT = 10Gbit/s * 0.05 ms = 1.25 GB/s * 0.05 ms = 1280MB/s * 0.05 ms = 1.28MB /ms * 0.05ms = 0.064 MB = 65.5 KB.
The vanilla kernel usually has tcp_rmem and tcp_wmem of 4K 16K 4M meaning that 16K default window would indeed give about 2.4Gb, but setting the window size larger should give you full 10G. However at least our first tests with vanilla kernel came up with nothing promising. We couldn't get past about 2.5Gbit/s and with multiple streams we were at best hitting 4.5Gbit/s. I'll continue the post when I continue the debugging tomorrow...
However in practice we see most nodes traffic in the 0.5-1.5Gbit/s range. Which is really really odd. We'd been suspecting the switches etc for a long while and have about 4 different mellanox tickets open including both switch and 10G card firmwares in production now, that were largely created by Mellanox because of our issues.
But today as Ilja was debugging another issue at one of his deployments on 1G he noticed a weird network performance drop. Even with basic tests he couldn't get 1G line speed, not even close. The bugzilla ticket in question: http://bugzilla.openvz.org/show_bug.cgi?id=2454
He tried repeating the test here in our datacenter on some spare older nodes with 1G networking and was able to reproduce the issue, which disappeared with kernel downgrade. He also tested speeds between 1G and 10G and was getting really bad results. So next up we planned to test it inside 10G fabric. I ran a test between two 10G nodes and no matter how I tried I was hard pressed to see more than half a Gbit/s speeds. I then decided to test direct HN <-> VZ container tests as those have been shown to be able to run without overhead so for 10G we should be able to get 9+ Gbit/s easily.
Well that's a nice thought, but this is the actual result:
[root@wn-v-5492 ~]# iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 512 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 10.10.23.164 port 34441 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 235 MBytes 394 Mbits/sec
[ 3] 5.0-10.0 sec 910 MBytes 1.53 Gbits/sec
[ 3] 10.0-15.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 15.0-20.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 20.0-25.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 25.0-30.0 sec 1.05 GBytes 1.80 Gbits/sec
[ 3] 0.0-30.0 sec 5.29 GBytes 1.52 Gbits/sec
[root@wn-v-5492 ~]#
Now the interesting thing to notice here is that firstly it takes a moment for the speed to pick up, but we can live with that. Then however it's pretty hard capped at 1.8Gb/s. This node was doing nothing else at the time, therefore it wasn't resource constraining.
Another interesting thing of note here is that if you go back to Ilja's bugzilla post that was done before I even started testing, then there's a nice quote:
"Another symptom - reversing the iperf testing direction (using VM as iperf client, and remote physical server as iperf server) results in even worse results, which are quite consistent: ~180Mbps"
As he was testing on 1Gbit/s networks he saw a throughput of 18%. We just now tested the exact same thing on a totally different hardware and got 18% throughput. That's just too hard to believe to be a coincidence. So we started to look into kernels. The one we were running at that time was: 2.6.32-042stab059.7, which is a rebase of 2.6.32-279.1.1.el6. We downgraded to 2.6.32-042stab053.5, which is a rebase of 2.6.32-220.7.1.el6. Rerunning the test:
iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 512 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 10.10.23.164 port 33264 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 1.24 GBytes 2.13 Gbits/sec
[ 3] 5.0-10.0 sec 1.39 GBytes 2.39 Gbits/sec
[ 3] 10.0-15.0 sec 1.42 GBytes 2.43 Gbits/sec
[ 3] 15.0-20.0 sec 1.41 GBytes 2.42 Gbits/sec
[ 3] 20.0-25.0 sec 1.42 GBytes 2.44 Gbits/sec
[ 3] 25.0-30.0 sec 1.42 GBytes 2.44 Gbits/sec
[ 3] 0.0-30.0 sec 8.30 GBytes 2.38 Gbits/sec
So, improvement. But not quite line speed yet. Next up we restarted two of the nodes in vanilla CentOS 6.3 liveCD's to test vanilla OS kernels and see if that can get the speed up. Our local RTT times are about 0.05ms meaning that a simple calculation shows that we'll need a basic 64KB TCP window size to get to 10G therefore in theory no tuning is needed for the vanilla kernel:
Window = BW * RTT = 10Gbit/s * 0.05 ms = 1.25 GB/s * 0.05 ms = 1280MB/s * 0.05 ms = 1.28MB /ms * 0.05ms = 0.064 MB = 65.5 KB.
The vanilla kernel usually has tcp_rmem and tcp_wmem of 4K 16K 4M meaning that 16K default window would indeed give about 2.4Gb, but setting the window size larger should give you full 10G. However at least our first tests with vanilla kernel came up with nothing promising. We couldn't get past about 2.5Gbit/s and with multiple streams we were at best hitting 4.5Gbit/s. I'll continue the post when I continue the debugging tomorrow...
Slurm and health check
For torque we used to run every 30 minutes a cronjob that checked if the node is working properly and if not it disabled them. With slurm I finally took the time to look for a way to have slurm automatically do it. Discovered it was extremely easy. You just need to add two config lines:
HealthCheckProgram=/home/test/testNode.sh
HealthCheckInterval=300
Now slurm runs every 5 minutes the health check program and if it gets stuck it's killed within 60s. The script has to perform a check and if a check fails it's got to take care of fixing it or disabling the node. It's done fairly simply. For example we check the presence of /hdfs directory for access to storage and if not ask slurm to drain the node:
# Test HDFS
NHDFS=`ls /hdfs|wc -l`
if [ $NHDFS -eq 0 ]; then
scontrol update NodeName=$NODE State=drain Reason="HDFS mount lost"
fi
You can add pretty much any check you want. The result is that sinfo nicely shows the drained nodes with reasons:
[root@slurm-1 ~]# sinfo -lN
Tue Dec 4 16:39:01 2012
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
wn-v-[2036,..,5384] 8 main* allocated 32 2:16:1 65536 0 1 (null) none
wn-v-[2072,..,7039] 19 main* drained 24+ 2:12+:1 49152+ 0 1 (null) HDFS mount lost
wn-v-[2324,..,6428] 6 main* idle 32 2:16:1 65536 0 1 (null) none
As you can see nodes that have lost the mount are automatically disabled and drained to be taken care of by you at some point.
HealthCheckProgram=/home/test/testNode.sh
HealthCheckInterval=300
Now slurm runs every 5 minutes the health check program and if it gets stuck it's killed within 60s. The script has to perform a check and if a check fails it's got to take care of fixing it or disabling the node. It's done fairly simply. For example we check the presence of /hdfs directory for access to storage and if not ask slurm to drain the node:
# Test HDFS
NHDFS=`ls /hdfs|wc -l`
if [ $NHDFS -eq 0 ]; then
scontrol update NodeName=$NODE State=drain Reason="HDFS mount lost"
fi
You can add pretty much any check you want. The result is that sinfo nicely shows the drained nodes with reasons:
[root@slurm-1 ~]# sinfo -lN
Tue Dec 4 16:39:01 2012
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
wn-v-[2036,..,5384] 8 main* allocated 32 2:16:1 65536 0 1 (null) none
wn-v-[2072,..,7039] 19 main* drained 24+ 2:12+:1 49152+ 0 1 (null) HDFS mount lost
wn-v-[2324,..,6428] 6 main* idle 32 2:16:1 65536 0 1 (null) none
As you can see nodes that have lost the mount are automatically disabled and drained to be taken care of by you at some point.
Monday, December 3, 2012
XFS vs EXT4
I've been hanging around the #gluster chat the past week to make sure we've gotten the gluster installation running smoothly and the one topic that keeps popping up is the underlying brick filesystem. The default recommendation of gluster people is XFS because of an EXT4 bug that causes major mayhem that got introduced in recent kernels and backported to RHEL 6 main branch. Read more about that here:
However I'd not jump to XFS that fast myself. In August-September this year after we rebuilt the datacenter for new power grid, new interconnect and lots of new nodes we were also planning to expand and change our storage to a more distributed model. The system we had been using was 15 storage nodes that in total provided raid5 volumes in total of 750TB. Now instead we planned to move to using pure disks directly fed to Hadoop as data directories and have every block in hadoop replicated at least twice. Now that would cost us ca 30-40% of good capacity (we'd regain the parity and hot spare drives so that alleviated it somewhat), but we'd also add all the workernode disks (each node had 3x 3TB drives of which ca 1TB was dedicated to local scratch space + OS) so the overall raw capacity went up to 2PB giving us 1PB of usable disk space in a distirbuted RAID 10 like environment.
All was good and nice except the power grid wasn't initially well balanced. So when we cranked up the job counts we sometimes hit a region where the breaker would turn off a block of 16-25 servers. As now all servers were also part of the storage that was annoying especially because those were real breakers so someone had to go physically on-site, enable the breaker and try to balance the power grid (luckily we had power phase usage history so could see which phases were over/under balances).
The bad part about the power loss was that it seems XFS was not at all safe against power loss. I don't mind losing the files that were actively being written (that means new files that would never get marked as completed therefore automatically triggering a redownload). However the bad part was that Hadoop claimed every time we lost a block of nodes to have lost anywhere from 300k to 1M blocks. Each block is 128MB. Now this was an annoyance most of the time because every block is double replicated so it just meant a lot of intra nodes traffic until the under replicated blocks got re-replicated. But with that many lost blocks it also meant that at times BOTH copies of a block were killed. When that happened we had data corruption and lost files.
Upon closer inspection we found that blocks were marked lost because the datanodes coming up after the reboot couldn't validate them. The reason was that either the datafile itself had been lost and replaced by a 0 byte file or the metadata file had. If either was lost, the block was considered lost. In theory we may have been able to recover some of them because if we lost both blocks it may well have happened that we lost the datafile for one and the block for the other, but it was way too much manual labor to be worth it.
After about 10 power losses and hundreds of TB of lost data we made a decision to move away from XFS even though this porblem of 0 size files is not supposed to be part of newer kernels and we were running CentOS 6.3 with OpenVZ kernel (not the newest at the time, but still 2.6.32-x kernel that should have all the XFS fixes). So after about a month of migration with decommissioning a block of datanodes, reformatting to ext4 and moving on we now run everything on EXT4. As it is we've not been hit by the above mentioned bug in hadoop related operations (we think) and it'll be something to think when operating glusterfs. Right now it's operated from the storage nodes that still run Scientific Linux 5.7 with kernels that are not affected, but if we plan to move gluster to a similar format like Hadoop it'll mean introducing bricks from newer kernel ext4 partitions and that may be an issue.
Lazy umount led to no NFS locking
Today we had one of the gluster NFS servers go wacko. No better name for it. Using the console it was nice and responsive as well as could talk with outside world, but seems to have had NFS issues. So we sent it to reboot and remounted our /home directory from another gluster node. To do that I performed a lazy umount:
# umount -l /home
the idea of course is to get the fs unmounted, then mount the correct one and as soon as different tasks get their i/o finalized they end up using the new filesystem already. However seems that's not a good idea as even though majority of functionality recovered without any users having to do anything it somehow screwed up the locking. Any task that wanted to set a lock on a file got an access denied error.
Restarting the VM didn't help. On the headnode we saw in dmesg:
[1038502.797998] lockd: cannot monitor 192.168.1.244
[1038508.865280] lockd: cannot monitor 192.168.1.244
which quickly pointed to nfslock service issue. Restarting it fixed everything. Let me now go and bang my head against the wall for a while.
# umount -l /home
the idea of course is to get the fs unmounted, then mount the correct one and as soon as different tasks get their i/o finalized they end up using the new filesystem already. However seems that's not a good idea as even though majority of functionality recovered without any users having to do anything it somehow screwed up the locking. Any task that wanted to set a lock on a file got an access denied error.
Restarting the VM didn't help. On the headnode we saw in dmesg:
[1038502.797998] lockd: cannot monitor 192.168.1.244
[1038508.865280] lockd: cannot monitor 192.168.1.244
which quickly pointed to nfslock service issue. Restarting it fixed everything. Let me now go and bang my head against the wall for a while.
Friday, November 30, 2012
torque with slow response times
Last night torque started to really act up again with qstat and qsub taking anywhere from 0.1s to tens of minutes if they finished at all. Of course that screwed up any decent job submissions and caused our site to decent to mayhem. At the very same time slurm seems to have been operating quite nicely.
Tracking it down it seems to not be related to maui prioritization at all this time (which has been a culprit at times), but instead to pbs_server <-> pbs_mom communication. Running an strace with timing information I was able to determine that certain select() and close() syscalls took up 99% of the running time. Here's an example trace:
[root@torque-v-1 ~]# grep select torque.trace |ctime
260.509
[root@torque-v-1 ~]# grep close torque.trace |ctime
258.974
[root@torque-v-1 ~]# ctime torque.trace
524.117
So 519.5s / 524.1s was spent on those select/close calls. And the time spent on those calls seems to be pretty binary. It either takes no time or it takes 5s:
[root@torque-v-1 ~]# grep select torque.trace |cut -f1 -d'.'|sort |uniq -c
690 0
52 5
You'll notice that there were a majority of calls that succeeded though. Trying to map the nodes that cause this didn't lead anywhere. Only a few nodes were repeat offenders with a total of 37 different nodes causing this, which points to a more generic network hiccup. So either it's on the server where torque is running or on the full network.
As a first thought as we've had mellanox issues I took a dump of the accsw-1 that is connecting all the service nodes to the interconnect and restarted it to clear all possible buffers. However this time it seems that at least a basic one switch restart didn't help me at all.
Next up I assumed maybe it's the OpenVZ limits that have overflown and to conserve time this time around I just restarted the container. Initially as it came up torque was responding nice and fast, but then after a minute or so the response time started to drop and I started to see the same 5s long calls in trace. Ok, so another option is that we have a few bad WN's or that it's the number of WN's that somehow is causing issues. So as a first iteration I stopped all pbs_mom processes on all WN's.
That seemed to improve the torque response time nicely, but made it quite useless as no nodes were communicating. Starting them up again one by one and already with 5 nodes I ran into the same timing issues. Since then we've been debugging the network overall and are seeing very odd hiccups across the board.
*sigh* off to network debugging that's not trivial at all (if it didn't work you could debug it, if it works, but only kind of and with no regular pattern it's a pain).
EDIT: Finally what helped was restarting the hardnode where torque was running. Why that fixed even though we saw network issues also on other nodes is beyond me. The only thing we did was move the gluster mount from fuse to nfs and migrate NAT gw from this hardnode to another (neither helped before the reboot). Seems a state related issue that got cleared, which however doesn't leave me at all happy on the solution...
Tracking it down it seems to not be related to maui prioritization at all this time (which has been a culprit at times), but instead to pbs_server <-> pbs_mom communication. Running an strace with timing information I was able to determine that certain select() and close() syscalls took up 99% of the running time. Here's an example trace:
[root@torque-v-1 ~]# grep select torque.trace |ctime
260.509
[root@torque-v-1 ~]# grep close torque.trace |ctime
258.974
[root@torque-v-1 ~]# ctime torque.trace
524.117
So 519.5s / 524.1s was spent on those select/close calls. And the time spent on those calls seems to be pretty binary. It either takes no time or it takes 5s:
[root@torque-v-1 ~]# grep select torque.trace |cut -f1 -d'.'|sort |uniq -c
690 0
52 5
You'll notice that there were a majority of calls that succeeded though. Trying to map the nodes that cause this didn't lead anywhere. Only a few nodes were repeat offenders with a total of 37 different nodes causing this, which points to a more generic network hiccup. So either it's on the server where torque is running or on the full network.
As a first thought as we've had mellanox issues I took a dump of the accsw-1 that is connecting all the service nodes to the interconnect and restarted it to clear all possible buffers. However this time it seems that at least a basic one switch restart didn't help me at all.
Next up I assumed maybe it's the OpenVZ limits that have overflown and to conserve time this time around I just restarted the container. Initially as it came up torque was responding nice and fast, but then after a minute or so the response time started to drop and I started to see the same 5s long calls in trace. Ok, so another option is that we have a few bad WN's or that it's the number of WN's that somehow is causing issues. So as a first iteration I stopped all pbs_mom processes on all WN's.
That seemed to improve the torque response time nicely, but made it quite useless as no nodes were communicating. Starting them up again one by one and already with 5 nodes I ran into the same timing issues. Since then we've been debugging the network overall and are seeing very odd hiccups across the board.
*sigh* off to network debugging that's not trivial at all (if it didn't work you could debug it, if it works, but only kind of and with no regular pattern it's a pain).
EDIT: Finally what helped was restarting the hardnode where torque was running. Why that fixed even though we saw network issues also on other nodes is beyond me. The only thing we did was move the gluster mount from fuse to nfs and migrate NAT gw from this hardnode to another (neither helped before the reboot). Seems a state related issue that got cleared, which however doesn't leave me at all happy on the solution...
Wednesday, November 28, 2012
glusterfs for /home
Well after spending the whole day re-arranging PATH variables and hacking to get this relogin safe I gave up as some commands required /home related paths to be relatively high in the lookup list and those caused a huge amount of ENOENT failures for syscalls and a huge overhead.
More discussions on #gluster IRC led to the fact that as gluster native mount doesn't support negative lookup caching, then this means that this is going to be forever slow until that's changed. The recommendation however was to move to NFS mount. It doesn't give the same flexibility as there's some proxy work going on through the NFS server, but for /home the difference should be marginal (we're not hitting the network limits anyway). So the result was that I moved from native glusterfs mount to NFS mount, which btw required on CentOS 6.3 some additional options:
# grep home /etc/fstab
192.168.1.241:/home0 /home nfs tcp,vers=3 0 0
The reason is that glusterfs NFS servers don't support udp protocol. However since moving to NFS based /home the lookup issues have disappeared and all work is nice and fast as it was before. Thanks to jdarcy on #gluster
More discussions on #gluster IRC led to the fact that as gluster native mount doesn't support negative lookup caching, then this means that this is going to be forever slow until that's changed. The recommendation however was to move to NFS mount. It doesn't give the same flexibility as there's some proxy work going on through the NFS server, but for /home the difference should be marginal (we're not hitting the network limits anyway). So the result was that I moved from native glusterfs mount to NFS mount, which btw required on CentOS 6.3 some additional options:
# grep home /etc/fstab
192.168.1.241:/home0 /home nfs tcp,vers=3 0 0
The reason is that glusterfs NFS servers don't support udp protocol. However since moving to NFS based /home the lookup issues have disappeared and all work is nice and fast as it was before. Thanks to jdarcy on #gluster
Profiling with strace
Well trying to figure out what was the main culprit of slow CRAB execution as well as close to impossible execution times of scram (toolkit like make, but more flexible and specific to CMS software) I wanted to validate wether the time consumed by stat() calls on not existing files is significant enough or there are other reasons. As the simple crab -status command took a good 100-140 seconds there should be plenty to see.
So the first task was to find a simple profiling tool. Started with strace -c, but that only showed system time, which isn't too helpful as the total execution time was 1.6 seconds for the full 2 minute running of the code.
Next up I tried to just see the relative execution times of different syscalls with strace -r -t. The output has every line in strace prepended with amount of time it took to run it (in seconds). For example:
6378 0.000000 execve("/opt/software/CRAB2/python/crab", ["crab", "-status"], [/* 156 vars */]) = 0 <0.012506>
6378 0.013422 brk(0) = 0xf79000 <0.000069>
6378 0.000518 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb904bfb000 <0.000143>
So looking through the whole execution with a simple addition seemed to double count the time, but at least gave a relative information capacity:
$ awk '{print $2}' crab.strace |sort -n|awk '{s+=$1} END { print s}'
208.992
To simplify counting I created an alias that summed together the time:
alias ctime="awk '{s+=\$2} END {print s}'"
So while the code probably run just 104 seconds we know we can add together this way. So now the check, did we lost most of the time in the failed openings and stats of files that do not exist:
$ grep ENOENT crab.strace |ctime
173.568
And the answer is a clear YES. So next up let's check the various system calls and their time:
$ for i in open stat access; do echo -n " $i -> "; grep " $i(" crab.strace |ctime; done
open -> 139.159
stat -> 39.4473
access -> 0.331267
So now I just need to figure out how to dramatically reduce the list of paths that is accessed for libraries and I can reduce the execution time of the code.
So the first task was to find a simple profiling tool. Started with strace -c, but that only showed system time, which isn't too helpful as the total execution time was 1.6 seconds for the full 2 minute running of the code.
Next up I tried to just see the relative execution times of different syscalls with strace -r -t. The output has every line in strace prepended with amount of time it took to run it (in seconds). For example:
6378 0.000000 execve("/opt/software/CRAB2/python/crab", ["crab", "-status"], [/* 156 vars */]) = 0 <0.012506>
6378 0.013422 brk(0) = 0xf79000 <0.000069>
6378 0.000518 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb904bfb000 <0.000143>
So looking through the whole execution with a simple addition seemed to double count the time, but at least gave a relative information capacity:
$ awk '{print $2}' crab.strace |sort -n|awk '{s+=$1} END { print s}'
208.992
To simplify counting I created an alias that summed together the time:
alias ctime="awk '{s+=\$2} END {print s}'"
So while the code probably run just 104 seconds we know we can add together this way. So now the check, did we lost most of the time in the failed openings and stats of files that do not exist:
$ grep ENOENT crab.strace |ctime
173.568
And the answer is a clear YES. So next up let's check the various system calls and their time:
$ for i in open stat access; do echo -n " $i -> "; grep " $i(" crab.strace |ctime; done
open -> 139.159
stat -> 39.4473
access -> 0.331267
So now I just need to figure out how to dramatically reduce the list of paths that is accessed for libraries and I can reduce the execution time of the code.
glusterfs and everyday working speeds
I have been experimenting around with glusterfs as we recently discovered that to move to SLURM means to move to a shared /home directory. So looking at possible free solutions I ended up testing glusterfs. Right now we use a 12 disk configuration (3TB drives) across 6 servers (2 each) with replication factor set to 3 (this is /home afterall).
The system in general seemed to behave quite well and a basic rsync from the old /home to the gluster one took ca 12h for 400GB of which majority are very small files. It's not extreme speeds, but it probably wouldn't have happened too much faster on ordinary between servers rsync either as the size of files is what matters here and the related iops.
I also did a few tests before starting with simple dd 'ng of /dev/zero to files in 1, 2, 4, 8 streams and found that I could average ca 210MB/s for such streamed writes which gives an average speed of ~50MB/s on the gluster bricks (x3 as replicas are written to 3 drives in parallel) which isn't terrific, but isn't bad either.
The trouble started when we started to actually use the gluster for everyday work and some things were just horribly slow. Hanging around #gluster on IRC I soon found that the main culprit is the stat() syscall and that basic ls command is extremely slow because of the --color argument that is aliased in on almost all Linuxes. To determine the color every file in a directory gets a stat() call to determine what it is and that can take 10s for 100 files or minutes and minutes on directories containing thousands of files.
I've still not figured out how to speed this up as I'm hitting the stat() speed issue on various fronts, but I did figure out with JoeJulian one cause of slowness. Namely CRAB (the toolkit we use for data analysis) is a python toolkit and the PYTHONPATH variable was set up such that when things got included it mostly traversed a good 10-20 filename attempts before finding the module that got loaded. On a local disk that hardly takes any time so you don't see the overhead. On glusterfs on its default config however this causes a problem. For more details check this blog by Joe Julian: DHT misses are expensive.
So basically a check for a non-existent file means the file hash is not found and glusterfs traverses the bricks looking for it. Finally not finding it it reports not found and python proceeds to the next filename possibility until it finds the actual file and then continues. This however can take a lot of time. So we disabled the looking up of files that don't match a hash. We're not sure if it'll come and bite us in the ass or not, but setting on our /home volume cluster.lookup-unhashed: off seems to have sped up crab operations by a good factor of 5-10.
Now if only I can get the stat() overall to perform way faster, then I'd already be quite satisfied with glusterfs.
The beginning...
The idea of this blog came mostly from people recommending to share my experiences in tackling various IT related issues. I've been planning something of the kind for a while so why the heck not. Expect nothing extraordinary, but just tidbits of experiences in adminning and using a CMS Tier 2 center for data analysis. Managing and using a 5000 core 2PB system has some quirks in both user and admin ends so might be of use to people who stumble on those ramblings here.
Subscribe to:
Posts (Atom)