in Hadoopery

Improving Hadoop datanode disk fault tolerance

By design, Hadoop is meant to tolerate failures in a responsible manner. One of those failure modes is for an HDFS datanode to go off line because it lost a data disk. By default, the datanode process will not tolerate any disk failures before shutting itself off. When this happens, the HDFS namenode discovers that copies of data blocks are missing from it’s map of data and it begins to replicate those blocks around from known good copies on other servers.

In many cases, only a single disk is having issues within a datanode. Depending on your environment, it might not make sense to take down a whole datanode because of this. For example, in my environment, we have twelve disks per system. If we lose a single disk that then causes the datanode to shutdown, we immediately lose twenty terabytes worth of space from the cluster. This could potentially be a lot. So what do we do in those cases?

Hadoop has a specific datanode option called dfs.datanode.failed.volumes.tolerated that can be set in hdfs-site.xml that tweaks the behavior. In our case, we believe we can tolerate up to two disk failures before the datanode shuts down. The downside to doing this is that you need to have infrastructure in place to monitor your physical hardware and notify you when disks begin failing so you can respond appropriately. If you don’t, you’ll only discover that there is a problem when the datanode shuts down.

You can set this by adding something like the following to your hdfs-site.xml and distributing it to your datanodes. It will require restarting the datanode process.

<property>
    <name<dfs.datanode.failed.volumes.tolerated</name>
    <value>2</value>
</property>

Another downside to using this option is that it can have adverse performance impacts on your datanode if you set this too high. For example, if you only have eight disks in a system, but wish to tolerate seven failures, that means the write performance of your datanode will get slower and slower as you try to shove large amounts of data onto the system. This is caused by not having a sufficient number of available working disk spindles. It’s obvious after the fact, but may be something you’re not aware of the first times you encounter this.

The option should be available in most versions of Hadoop out today.

If you’d like to read more, you can find information at:

Using cobbler with a fast file system creation snippet for Kickstart %post install of Hadoop nodes

I run Hadoop servers with 12 2TB hard drives in them. One of the bottlenecks with this occurs during kickstart when we’re using anaconda to create the filesystems. Previously, I just had a specific partition configuration that was brought in during %pre, but this caused the filesystem formatting section of kickstart to take several hours […]