Monday, December 3, 2012


I've been hanging around the #gluster chat the past week to make sure we've gotten the gluster installation running smoothly and the one topic that keeps popping up is the underlying brick filesystem. The default recommendation of gluster people is XFS because of an EXT4 bug that causes major mayhem that got introduced in recent kernels and backported to RHEL 6 main branch. Read more about that here:

However I'd not jump to XFS that fast myself. In August-September this year after we rebuilt the datacenter for new power grid, new interconnect and lots of new nodes we were also planning to expand and change our storage to a more distributed model. The system we had been using was 15 storage nodes that in total provided raid5 volumes in total of 750TB. Now instead we planned to move to using pure disks directly fed to Hadoop as data directories and have every block in hadoop replicated at least twice. Now that would cost us ca 30-40% of good capacity (we'd regain the parity and hot spare drives so that alleviated it somewhat), but we'd also add all the workernode disks (each node had 3x 3TB drives of which ca 1TB was dedicated to local scratch space + OS) so the overall raw capacity went up to 2PB giving us 1PB of usable disk space in a distirbuted RAID 10 like environment. 

All was good and nice except the power grid wasn't initially well balanced. So when we cranked up the  job counts we sometimes hit a region where the breaker would turn off a block of 16-25 servers. As now all servers were also part of the storage that was annoying especially because those were real breakers so someone had to go physically on-site, enable the breaker and try to balance the power grid (luckily we had power phase usage history so could see which phases were over/under balances). 

The bad part about the power loss was that it seems XFS was not at all safe against power loss. I don't mind losing the files that were actively being written (that means new files that would never get marked as completed therefore automatically triggering a redownload). However the bad part was that Hadoop claimed every time we lost a block of nodes to have lost anywhere from 300k to 1M blocks. Each block is 128MB. Now this was an annoyance most of the time because every block is double replicated so it just meant a lot of intra nodes traffic until the under replicated blocks got re-replicated. But with that many lost blocks it also meant that at times BOTH copies of a block were killed. When that happened we had data corruption and lost files. 

Upon closer inspection we found that blocks were marked lost because the datanodes coming up after the reboot couldn't validate them. The reason was that either the datafile itself had been lost and replaced by a 0 byte file or the metadata file had. If either was lost, the block was considered lost. In theory we may have been able to recover some of them because if we lost both blocks it may well have happened that we lost the datafile for one and the block for the other, but it was way too much manual labor to be worth it.

After about 10 power losses and hundreds of TB of lost data we made a decision to move away from XFS even though this porblem of 0 size files is not supposed to be part of newer kernels and we were running CentOS 6.3 with OpenVZ kernel (not the newest at the time, but still 2.6.32-x kernel that should have all the XFS fixes). So after about a month of migration with decommissioning a block of datanodes, reformatting to ext4 and moving on we now run everything on EXT4. As it is we've not been hit by the above mentioned bug in hadoop related operations (we think) and it'll be something to think when operating glusterfs. Right now it's operated from the storage nodes that still run Scientific Linux 5.7 with kernels that are not affected, but if we plan to move gluster to a similar format like Hadoop it'll mean introducing bricks from newer kernel ext4 partitions and that may be an issue. 

No comments:

Post a Comment