in Hadoopery

My Hadoop cluster data needs no RAID!

One of the operational challenges in introducing Hadoop to traditional IT and Enterprise operations is understanding when to break one of our sacred IT mantras:

Thou shalt always RAID your data. Never shalt thou install a system without RAID. One shall be your RAID if thou seekest performance and redundancy sparing no expense. Five shall be your RAID if thou art cheap. Your RAID shall be Six if not so cheap, but still cheap. The zeroeth RAID is for sinners and heathens who do not believe in safety.

… or so it has been told in the annals of Sysadmin-dom. So, RAID is important in ensuring your systems don’t have unnecessary down time. I think we all agree on that. Except when it comes to systems involved in Hadoop.

The following exchange is from a discussion that popped up on the Hadoop Users mailing list. In it, the questioner asked about the utility of RAID in traditional IT environments where Hadoop is taking a foothold. My response covers why one would not use it and some simple steps for helping stalwart proponents of RAID agree to take the risk associated with not using it.

On Wed, Oct 1, 2014 at 4:01 PM, Ulul wrote:
Dear hadoopers,

Has anyone been confronted to deploying a cluster in a traditional IT shop whose admins handle thousands of servers ?
They traditionally use SAN or NAS storage for app data, rely on RAID 1 for system disks and in the few cases where internal disks are used, they configure them with RAID 5 provided by the internal HW controller.

Yes. I’ve been on both sides of this discussion.

The key is to help them understand that you don’t need redundancy within a system because Hadoop provides redundancy across the entire cluster via replication. This then leaves the problem as a performance one, in which case you show them benchmarks on the hardware they provide in both RAID (RAID0, RAID1, and RAID5) and JBOD (Just a Bunch Of Disks) modes.

Using a JBOD setup , as advised in each and every Hadoop doc I ever laid my hands on, means that each HDD failure will imply, on top of the physical replacement of the drive, that an admin performs at least an mkfs.
Added to the fact that these operations will become more frequent since more internal disks will be used, it can be perceived as an annoying disruption in industrial handling of numerous servers.

I fail to see how this is really any different than the process of having to deal with a failed drive in an array. Depending on your array type, you may still have to do things to quiesce the bus before doing any drive operation, such as adding or removing the drive, you may still have to trigger the rebuild yourself, and so on.

I have a few thousand disks in my cluster. We lose about 3-5 a quarter. I don’t find it any more work to re-mkfs the drive after it’s been swapped out and have built tools around the process to make sure it’s consistently done by our DC staff (and yes, I did it before the DC staff was asked to). If you’re concerned about the high-touch aspect of swapping disks out, then you can always configure the datanode to be tolerant of multiple disk failures (something you cannot do with RAID5) and then just take the whole machine out of the cluster to do swaps when you’ve reached a particular threshold of bad disks.

In Tom White’s guide there is a discussion of RAID 0, stating that Yahoo benchmarks showed a 10% loss in performance so we can expect even worse perf with RAID 5 but I found no figures.

I had to re-read that section for reference. It is quoted below for reference.

I’m going to assume that Tom is not talking about single-disk RAID0 volumes, which is a common way of doing JBOD with a RAID controller that doesn’t have JBOD support.

In general, performance will depend upon how many active streams of I/O you have going on the system.

With JBOD, as Tom discusses, every spindle is its own unique snow flake, and if your drive controller can keep up, you can write as fast as that drive can handle reading off the bus. Performance will depend upon how many active reading/writing streams you have accessing each spindle in the systems.

If I had one stream, I would only get the performance of one spindle in the JBOD. If I had twelve spindles, I’m going to get maximum performance with at least twelve streams. With RAID0, you’re taking your one stream, cutting it up into multiple parts and either reading it or writing it to all disks, taking advantage of the performance of all spindles.

The problem arises when you start adding more streams in parallel to the RAID0 environment. Each parallel I/O operation begins competing with each other from the controller’s standpoint. Sometimes things start to stack up as the controller has to wait for competing I/O operations on a single spindle. For example, having to wait for a write to complete before a read can be done.

At a certain point, the performance of RAID0 begins to hit a knee as the number of I/O requests goes up because the controller becomes the bottleneck. RAID0 is going to be the closest in performance, but with the risk that if you lose a single disk, you lose the entire RAID. With JBOD, if you lose a single disk, you only lose the data on that disk.

Now, with RAID5, you’re going to have even worse performance because you’re dealing with not only the parity calculation, but also with the fact that you incur a performance penalty during reads and writes due to how the data is laid out across all disks in the RAID. You ca read more about this here: http://theithollow.com/2012/03/understanding-raid-penalty/

To put this in perspective, I use 12 7200rpm NLSAS disks in a system connected to an LSI9207 SAS controller. This is configured for JBOD. I have benchmarked streaming reads and writes in this environment to be between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a total of 12 i/o streams occurring on the system. By the way, this benchmark has held stable at this rate for at least 3 I/O streams per spindle; I haven’t tested higher yet.

Now, I might get this performance with RAID0, but why should I tolerate the risk of losing all data on the system vs just the data on a single drive? Going with RAID0 means that not only do I have to replace the disk, but now I have to have Hadoop rebalance/redistribute data to the entire system, not just dealing with the small amount of data missing from one spindle. And since Hadoop is already handling my redundancy via replication of data, why should I tolerate the performance penalty associated with RAID5? I don’t need redundancy in a single system, I need redundancy across the entire cluster.

I also found an Hortonworks interview of StackIQ who provides software to automate such failure fix up. But it would be rather painful to go straight to another solution, contract and so on while starting with Hadoop.

Please share your experiences around RAID for redundancy (1, 5 or other) in Hadoop conf.

I can’t see any situation that we would use RAID for the data drives in our Hadoop cluster. We only use RAID1 for the OS drives, simply because we want to reduce the recovery period associated with a system failure. No reason to re-install a system and have to replicate data back onto it if we don’t have to.

Notes

I have quoted Tom’s book here in case you don’t have a copy.

HDFS clusters do not benefit from using RAID (Redundant Array of Independent Disks) for datanode storage (although RAID is recommended for the namenode’s disks to protect against corruption of it’s metadata). The redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes.

Furthermore, RAID striping (RAID 0), which is commonly used to increase performance, turns out to be slower than the JBOD (Just a Bunch Of Disks) configuration used by HDFS, which round-robins HDFS blocks between all disks. This is because RAID 0 read and write operations are limited by the speed of the slowest disk in the RAID array. In JBOD, disk operations are independent, so the average speed of operations is greater than that of the slowest disk. DIsk performance often shows considerable variable in practice, even for disks of the same model. In some benchmarking carried out on a Yahoo! cluster (http://markmail.org/message/xmzc45zi25htr7ry) JBOD performed 10% faster than RAID 0 in one test (Gridmix) and 30% better in another (HDFS write throughput).

Finally, if a disk fails in a JBOD configuration, HDFS can continue to operate without the failed disk, whereas with RAID, failure of a single disk causes the whole array (and hence the node) to become unavailable.

Hadoop: The Definitive Guide, 3rd Edition