Ramblings on IT and Physics: November 2012

Friday, November 30, 2012

torque with slow response times

Last night torque started to really act up again with qstat and qsub taking anywhere from 0.1s to tens of minutes if they finished at all. Of course that screwed up any decent job submissions and caused our site to decent to mayhem. At the very same time slurm seems to have been operating quite nicely.

Tracking it down it seems to not be related to maui prioritization at all this time (which has been a culprit at times), but instead to pbs_server <-> pbs_mom communication. Running an strace with timing information I was able to determine that certain select() and close() syscalls took up 99% of the running time. Here's an example trace:

[root@torque-v-1 ~]# grep select torque.trace |ctime
260.509
[root@torque-v-1 ~]# grep close torque.trace |ctime
258.974
[root@torque-v-1 ~]# ctime torque.trace
524.117

So 519.5s / 524.1s was spent on those select/close calls. And the time spent on those calls seems to be pretty binary. It either takes no time or it takes 5s:

[root@torque-v-1 ~]# grep select torque.trace |cut -f1 -d'.'|sort |uniq -c
690 0
52 5

You'll notice that there were a majority of calls that succeeded though. Trying to map the nodes that cause this didn't lead anywhere. Only a few nodes were repeat offenders with a total of 37 different nodes causing this, which points to a more generic network hiccup. So either it's on the server where torque is running or on the full network.

As a first thought as we've had mellanox issues I took a dump of the accsw-1 that is connecting all the service nodes to the interconnect and restarted it to clear all possible buffers. However this time it seems that at least a basic one switch restart didn't help me at all.

Next up I assumed maybe it's the OpenVZ limits that have overflown and to conserve time this time around I just restarted the container. Initially as it came up torque was responding nice and fast, but then after a minute or so the response time started to drop and I started to see the same 5s long calls in trace. Ok, so another option is that we have a few bad WN's or that it's the number of WN's that somehow is causing issues. So as a first iteration I stopped all pbs_mom processes on all WN's.

That seemed to improve the torque response time nicely, but made it quite useless as no nodes were communicating. Starting them up again one by one and already with 5 nodes I ran into the same timing issues. Since then we've been debugging the network overall and are seeing very odd hiccups across the board.

*sigh* off to network debugging that's not trivial at all (if it didn't work you could debug it, if it works, but only kind of and with no regular pattern it's a pain).

EDIT: Finally what helped was restarting the hardnode where torque was running. Why that fixed even though we saw network issues also on other nodes is beyond me. The only thing we did was move the gluster mount from fuse to nfs and migrate NAT gw from this hardnode to another (neither helped before the reboot). Seems a state related issue that got cleared, which however doesn't leave me at all happy on the solution...

Wednesday, November 28, 2012

glusterfs for /home

Well after spending the whole day re-arranging PATH variables and hacking to get this relogin safe I gave up as some commands required /home related paths to be relatively high in the lookup list and those caused a huge amount of ENOENT failures for syscalls and a huge overhead.

More discussions on #gluster IRC led to the fact that as gluster native mount doesn't support negative lookup caching, then this means that this is going to be forever slow until that's changed. The recommendation however was to move to NFS mount. It doesn't give the same flexibility as there's some proxy work going on through the NFS server, but for /home the difference should be marginal (we're not hitting the network limits anyway). So the result was that I moved from native glusterfs mount to NFS mount, which btw required on CentOS 6.3 some additional options:

# grep home /etc/fstab
192.168.1.241:/home0 /home nfs tcp,vers=3 0 0

The reason is that glusterfs NFS servers don't support udp protocol. However since moving to NFS based /home the lookup issues have disappeared and all work is nice and fast as it was before. Thanks to jdarcy on #gluster

Profiling with strace

Well trying to figure out what was the main culprit of slow CRAB execution as well as close to impossible execution times of scram (toolkit like make, but more flexible and specific to CMS software) I wanted to validate wether the time consumed by stat() calls on not existing files is significant enough or there are other reasons. As the simple crab -status command took a good 100-140 seconds there should be plenty to see.

So the first task was to find a simple profiling tool. Started with strace -c, but that only showed system time, which isn't too helpful as the total execution time was 1.6 seconds for the full 2 minute running of the code.

Next up I tried to just see the relative execution times of different syscalls with strace -r -t. The output has every line in strace prepended with amount of time it took to run it (in seconds). For example:

6378 0.000000 execve("/opt/software/CRAB2/python/crab", ["crab", "-status"], [/* 156 vars */]) = 0 <0.012506>
6378 0.013422 brk(0) = 0xf79000 <0.000069>
6378 0.000518 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb904bfb000 <0.000143>

So looking through the whole execution with a simple addition seemed to double count the time, but at least gave a relative information capacity:

$ awk '{print $2}' crab.strace |sort -n|awk '{s+=$1} END { print s}'
208.992

To simplify counting I created an alias that summed together the time:

alias ctime="awk '{s+=\$2} END {print s}'"

So while the code probably run just 104 seconds we know we can add together this way. So now the check, did we lost most of the time in the failed openings and stats of files that do not exist:

$ grep ENOENT crab.strace |ctime
173.568

And the answer is a clear YES. So next up let's check the various system calls and their time:

$ for i in open stat access; do echo -n " $i -> "; grep " $i(" crab.strace |ctime; done
open -> 139.159
stat -> 39.4473
access -> 0.331267

So now I just need to figure out how to dramatically reduce the list of paths that is accessed for libraries and I can reduce the execution time of the code.

glusterfs and everyday working speeds

I have been experimenting around with glusterfs as we recently discovered that to move to SLURM means to move to a shared /home directory. So looking at possible free solutions I ended up testing glusterfs. Right now we use a 12 disk configuration (3TB drives) across 6 servers (2 each) with replication factor set to 3 (this is /home afterall).

The system in general seemed to behave quite well and a basic rsync from the old /home to the gluster one took ca 12h for 400GB of which majority are very small files. It's not extreme speeds, but it probably wouldn't have happened too much faster on ordinary between servers rsync either as the size of files is what matters here and the related iops.

I also did a few tests before starting with simple dd 'ng of /dev/zero to files in 1, 2, 4, 8 streams and found that I could average ca 210MB/s for such streamed writes which gives an average speed of ~50MB/s on the gluster bricks (x3 as replicas are written to 3 drives in parallel) which isn't terrific, but isn't bad either.

The trouble started when we started to actually use the gluster for everyday work and some things were just horribly slow. Hanging around #gluster on IRC I soon found that the main culprit is the stat() syscall and that basic ls command is extremely slow because of the --color argument that is aliased in on almost all Linuxes. To determine the color every file in a directory gets a stat() call to determine what it is and that can take 10s for 100 files or minutes and minutes on directories containing thousands of files.

I've still not figured out how to speed this up as I'm hitting the stat() speed issue on various fronts, but I did figure out with JoeJulian one cause of slowness. Namely CRAB (the toolkit we use for data analysis) is a python toolkit and the PYTHONPATH variable was set up such that when things got included it mostly traversed a good 10-20 filename attempts before finding the module that got loaded. On a local disk that hardly takes any time so you don't see the overhead. On glusterfs on its default config however this causes a problem. For more details check this blog by Joe Julian: DHT misses are expensive.

So basically a check for a non-existent file means the file hash is not found and glusterfs traverses the bricks looking for it. Finally not finding it it reports not found and python proceeds to the next filename possibility until it finds the actual file and then continues. This however can take a lot of time. So we disabled the looking up of files that don't match a hash. We're not sure if it'll come and bite us in the ass or not, but setting on our /home volume cluster.lookup-unhashed: off seems to have sped up crab operations by a good factor of 5-10.

Now if only I can get the stat() overall to perform way faster, then I'd already be quite satisfied with glusterfs.

The beginning...

The idea of this blog came mostly from people recommending to share my experiences in tackling various IT related issues. I've been planning something of the kind for a while so why the heck not. Expect nothing extraordinary, but just tidbits of experiences in adminning and using a CMS Tier 2 center for data analysis. Managing and using a 5000 core 2PB system has some quirks in both user and admin ends so might be of use to people who stumble on those ramblings here.