Last night torque started to really act up again with qstat and qsub taking anywhere from 0.1s to tens of minutes if they finished at all. Of course that screwed up any decent job submissions and caused our site to decent to mayhem. At the very same time slurm seems to have been operating quite nicely.
Tracking it down it seems to not be related to maui prioritization at all this time (which has been a culprit at times), but instead to pbs_server <-> pbs_mom communication. Running an strace with timing information I was able to determine that certain select() and close() syscalls took up 99% of the running time. Here's an example trace:
[root@torque-v-1 ~]# grep select torque.trace |ctime
260.509
[root@torque-v-1 ~]# grep close torque.trace |ctime
258.974
[root@torque-v-1 ~]# ctime torque.trace
524.117
So 519.5s / 524.1s was spent on those select/close calls. And the time spent on those calls seems to be pretty binary. It either takes no time or it takes 5s:
[root@torque-v-1 ~]# grep select torque.trace |cut -f1 -d'.'|sort |uniq -c
690 0
52 5
You'll notice that there were a majority of calls that succeeded though. Trying to map the nodes that cause this didn't lead anywhere. Only a few nodes were repeat offenders with a total of 37 different nodes causing this, which points to a more generic network hiccup. So either it's on the server where torque is running or on the full network.
As a first thought as we've had mellanox issues I took a dump of the accsw-1 that is connecting all the service nodes to the interconnect and restarted it to clear all possible buffers. However this time it seems that at least a basic one switch restart didn't help me at all.
Next up I assumed maybe it's the OpenVZ limits that have overflown and to conserve time this time around I just restarted the container. Initially as it came up torque was responding nice and fast, but then after a minute or so the response time started to drop and I started to see the same 5s long calls in trace. Ok, so another option is that we have a few bad WN's or that it's the number of WN's that somehow is causing issues. So as a first iteration I stopped all pbs_mom processes on all WN's.
That seemed to improve the torque response time nicely, but made it quite useless as no nodes were communicating. Starting them up again one by one and already with 5 nodes I ran into the same timing issues. Since then we've been debugging the network overall and are seeing very odd hiccups across the board.
*sigh* off to network debugging that's not trivial at all (if it didn't work you could debug it, if it works, but only kind of and with no regular pattern it's a pain).
EDIT: Finally what helped was restarting the hardnode where torque was running. Why that fixed even though we saw network issues also on other nodes is beyond me. The only thing we did was move the gluster mount from fuse to nfs and migrate NAT gw from this hardnode to another (neither helped before the reboot). Seems a state related issue that got cleared, which however doesn't leave me at all happy on the solution...
No comments:
Post a Comment