I have been experimenting around with glusterfs as we recently discovered that to move to SLURM means to move to a shared /home directory. So looking at possible free solutions I ended up testing glusterfs. Right now we use a 12 disk configuration (3TB drives) across 6 servers (2 each) with replication factor set to 3 (this is /home afterall).
The system in general seemed to behave quite well and a basic rsync from the old /home to the gluster one took ca 12h for 400GB of which majority are very small files. It's not extreme speeds, but it probably wouldn't have happened too much faster on ordinary between servers rsync either as the size of files is what matters here and the related iops.
I also did a few tests before starting with simple dd 'ng of /dev/zero to files in 1, 2, 4, 8 streams and found that I could average ca 210MB/s for such streamed writes which gives an average speed of ~50MB/s on the gluster bricks (x3 as replicas are written to 3 drives in parallel) which isn't terrific, but isn't bad either.
The trouble started when we started to actually use the gluster for everyday work and some things were just horribly slow. Hanging around #gluster on IRC I soon found that the main culprit is the stat() syscall and that basic ls command is extremely slow because of the --color argument that is aliased in on almost all Linuxes. To determine the color every file in a directory gets a stat() call to determine what it is and that can take 10s for 100 files or minutes and minutes on directories containing thousands of files.
I've still not figured out how to speed this up as I'm hitting the stat() speed issue on various fronts, but I did figure out with JoeJulian one cause of slowness. Namely CRAB (the toolkit we use for data analysis) is a python toolkit and the PYTHONPATH variable was set up such that when things got included it mostly traversed a good 10-20 filename attempts before finding the module that got loaded. On a local disk that hardly takes any time so you don't see the overhead. On glusterfs on its default config however this causes a problem. For more details check this blog by Joe Julian: DHT misses are expensive.
So basically a check for a non-existent file means the file hash is not found and glusterfs traverses the bricks looking for it. Finally not finding it it reports not found and python proceeds to the next filename possibility until it finds the actual file and then continues. This however can take a lot of time. So we disabled the looking up of files that don't match a hash. We're not sure if it'll come and bite us in the ass or not, but setting on our /home volume cluster.lookup-unhashed: off seems to have sped up crab operations by a good factor of 5-10.
Now if only I can get the stat() overall to perform way faster, then I'd already be quite satisfied with glusterfs.
No comments:
Post a Comment