in Hadoopery

Transparent Huge Pages on Hadoop makes me sad.

Today I (re)learned that I should pay attention to the details of Linux kernel bugs associated with my favorite distribution. Especially if I’m working on CentOS 6/Red Hat Enterprise Linux (RHEL) 6 nodes running Transparent Huge Pages on Hadoop workloads. I was investigating some performance issues on our largest Hadoop cluster related to Datanode I/O and (re)ran across a discussion about this issue.

Huge Pages … how do they work?

In the Linux kernel, the standard size for a block of addressable memory is four kilobytes. This is called a page. Every system has a finite number of addressable pages, based on how much physical memory the system has installed. A page table is used to keep track of the location of each page in the system. This table can get gigantic.

For a single process that has hundreds of thousands of pages allocated for memory usage, you incur a measurable performance penalty as you iterate through that list. With Huge Pages, you can take a larger chunk of memory and give it a single address, allowing the page table to be smaller, but still addressing the same amount of physical memory. This provides a performance boost because both the kernel and the CPU have to keep track of fewer memory objects.

Huge Pages have some operational overhead. In the initial implementation, they needed to be pre-allocated at system boot time and you needed to make modifications to your code in order to explicitly take advantage of those pre-allocated Huge Pages.

But what about Transparent Huge Pages?

Transparent Huge Pages are an iteration on Huge Pages. Because of the complexity associated with with configuring Huge Pages and using them at run time, an effort was made to hide this complexity and have the kernel attempt to configure and use Huge Pages without any effort on the part of the developer or system administrator. This would give a “free” performance boost to applications utilizing large amounts of memory or large, contiguous chunks of memory.

Free Performance? Oh yeah!

Hold up there, buddy. TANSTAAFL. One of the downsides to using larger page sizes is that the kernel can only allocate the Huge Page in a contiguous chunk of memory. Over time, memory will get fragmented as Huge Pages are allocated and deallocated. The kernel will need to issue periodic memory compactions in order to get enough free memory to create Huge Pages or Transparent Huge Pages.

Unfortunately, when backporting the Transparent Huge Page support to RHEL6, Red Hat introduced some kind of bug. Of course, you can’t actually see the contents of it if you’re not a current Red Hat subscriber.

For reference, you can check out it out at:

2015-02-25 Followup:  I ran across an additional bug that has a more thorough discussion regarding it:  Red Hat Bug #879801 affecting Fedora.  It is additionally discussed on LKML here and here.  

Transparent Huge Pages are dead Jim

So, now that we know that Transparent Huge Pages are likely affecting us, how do we disable them? Red Hat describes it pretty succinctly.

There are two ways.

First, you can do this at boot time by placing transparent_hugepage=never in your /etc/grub.conf kernel command lines. This requires a reboot. If you use puppet, you can work with augeas and puppet to modify grub.conf

Alternatively, if you don’t want to reboot, you can do this at run time. The caveat is that you will need to create an init.d script to reset the state at each boot.

echo 'never' > /sys/kernel/mm/redhat_transparent_hugepage/enabled

Now, if you don’t want to completely disable Transparent Huge Pages, Cloudera Documentation for CDH4 suggests that you only need to disable the defragmentation of memory associated with Transparent Huge Pages. You can do that instead:

echo 'never' >  /sys/kernel/mm/redhat_transparent_hugepage/defrag

Visualizing CPU Usage

So, now that we’ve disabled the Transparent Huge Pages, what can we expect? When I looked at this originally, I could not find a good performance description of the impact. So, we never implemented it.

Of course, two years later people are now able to find information. Oracle claims they see a 30% degradation in I/O performance when Transparent Huge Pages are enabled with Oracle DB workloads.

I was curious about how it would impact us. I turned Transparent Huge Pages off on my largest cluster which was showing very erratic system CPU usage.

Dramatic System CPU drop disabling Transparent Huge Pages on Hadoop with CentOS6/RHEL6 Hadoop nodes

Dramatic System CPU drop disabling Transparent Huge Pages on Hadoop with CentOS6/RHEL6 Hadoop nodes

Holy. Crap. Now that is a dramatic drop. Note: this affects only the sys, not user, CPU utilization as far as I can tell. I’m still poking through the results to see if there are any other effects.

How did I miss this?

Well, I didn’t exactly miss it. When I originally came across this, the interwebs said this was primarily affecting the kernels in RHEL6.2 and RHEL6.3. We were beyond that. Additionally, CentOS Bug 5716 indicated that the issue had been resolved in CentOS6.4 (and thus, resolved by RHEL6.4). So I blew it off as being unnecessary to attempt to resolve. The bulk of our cluster is running 6.5 with a smattering of 6.4 peppered on older systems. Doing a new set of Google searches today, I came across a CentOS mailing list thread about transparent hugepage problems in 6.4 and 6.5. Go figure.

Looks like turning it off confirmed the issue, as evident in our performance graph above.

Live and learn. And now that we’re aware of it, we’ve begun looking at other places in our environment where we might be running into this issue. You probably should too.

References on Transparent Huge Pages on Hadoop

Now, I’m not the first to encounter this issue. If you’d like to learn more about it, here’s a few things that I came across in my work today.

What’s in my datacenter tool kit?

Every Operations person or datacenter (DC) junkie that I know has a datacenter tool kit of some sort, containing their favorite bits of gear for doing work inside the cold, lonely world of the datacenter. Now, one would like to think that each company stocks the right tools for their folks to work, but tools […]

HBase Motel: SPLITS check in but don’t check out

In HBase, the Master process will periodically call for the splitting of a region if it becomes too large. Normally, this happens automatically, though you can manually trigger a split. In our case, we rarely do an explicit region split by hand. A new Master SPLIT behavior: let’s investigate We have an older HBase cluster […]

Restarting HBase Regionservers using JSON and jq

We run HBase as part of our Hadoop cluster. HBase sits on top of HDFS and is split into two parts: the HBase Master and the HBase Regionservers. The master coordinates which regionservers are in control of each specific region. Automating Recovery Responses We periodically have to do some minor maintenance and upkeep, including restarting […]

My Hadoop cluster data needs no RAID!

One of the operational challenges in introducing Hadoop to traditional IT and Enterprise operations is understanding when to break one of our sacred IT mantras: Thou shalt always RAID your data. Never shalt thou install a system without RAID. One shall be your RAID if thou seekest performance and redundancy sparing no expense. Five shall […]

Improving Hadoop datanode disk fault tolerance

By design, Hadoop is meant to tolerate failures in a responsible manner. One of those failure modes is for an HDFS datanode to go off line because it lost a data disk. By default, the datanode process will not tolerate any disk failures before shutting itself off. When this happens, the HDFS namenode discovers that […]

Rebooting Linux temporarily loses (some) limits.conf settings

Like any wildly managed environment, you probably have to create custom-defined settings in your /etc/security/limits.conf because of application specific requirements. Maybe you have to allow for more open files. Maybe you have to reduce the memory allowed to a process. Or maybe you just like being ultra-hardcore in defining exactly what a user can do. […]

Augeas made my grub.conf too quiet!

A reader contacted me after working through the examples in my last previous post on Augeas. He was having a difficult time figuring out how to add a valueless key back into the kernel command line. This was the opposite of what I was doing with quiet and rhgb Many thanks. I have been pounding […]

Using augeas and puppet to modify grub.conf

In my environment, we rely heavily upon Puppet to do large-scale automation and updating of our various systems. One of the tasks that we do on an infrequent basis is to modify grub.conf across many (or even all) systems to apply the same types of changes. In Puppet, there are several ways you can do […]

Running Hadoop data nodes from USB thumb drives?

I received an interesting question today from a reader regarding the use of USB thumb drives for the OS drives in Hadoop datanodes. Have you ever put the OS for a Hadoop node on a USB thumb drive? (or considered it) I have a smaller 8 node cluster and that would free up one of […]