in Hadoopery

Treat your Hadoop nodes like cattle

Hadoop Nodes: Cattle, not pets

I’ve built compute clusters of various sizes, from hundreds to tens of thousands of systems, for almost two decades now.  One of the things I learned early on is that, for compute clusters, you want to treat each system as cookie cutter as possible.  By that, I mean there should be a minimal set of differences between any pair of systems in the cluster.  Usually this is limited to the networking configuration, IP address, gateway, and so on. It should be just enough to uniquely identify the system so you can find it when needed, but not so unique that you have to give it personal attention every time something happens to it.

Thanks to Bill Baker, who first originated this idea: treat your systems like cattle, not pets. Pets are something you give a name to. You give them personal attention and affection.  They are special to you. Cattle, however … there are so many of them, you can’t name them.  You shouldn’t.  There are so many, you can’t give them personal attention.  In fact, you probably shouldn’t.  You give each of them the same treatment because you want them all to grow up and work in the same continuously healthy way.

Treat your systems like cattle, not pets.

Operations Wild West: A Short History

When I started in High Performance Computing, our team was building compute clusters of hundreds of nodes where each system, while they were effectively the same, were individually named.  We used Texas towns, every word that began with X, plays on words, gods, demigods, minor demons, and so on.  Frankly, finding useful naming themes was exhausting. People get real feisty about naming standards.

When we finally broached the thousand-system mark, we tossed in the naming towel. We began treating things less as pets and more as cattle. Whole groups of systems were given a single name with a serialized part: wombat647, wombat648, wombat649, and so on.  Each major purchase was grouped by this single name.  Worked great.  Somewhat.  We still had the problem of having to gather MAC addresses, labelling all the systems, making sure we didn’t assign things out-of-order in the racks, and so on.  It was still far too hands-on with the systems during provisioning, we just didn’t realize it.

Moving from HPC to Hadoop

When I started building out Hadoop clusters, I had to pay attention to the rack topology to deal with HDFS block placement and data locality in MapReduce. Now, with Hadoop, when people talk about rack topology, what they really mean is switch topology.  You want to keep your data on the local switch as much as possible, minimizing traversal of the switch uplink to your core or spine network because of the “cost” associated with it.  You may have a switch with 48 1gig ports, but only a single 10gig uplink into your core.

In order to work with this topology awareness, you configure the topology.script.file.name in both the Namenode and Jobtracker to define which rack (“switch!”) the Datanode is associated with.  In my case, the easiest way to do that was to figure it out from the hostname.  I could have used the IP address allocations to help slice and dice this, but because the cluster was new (and small), the IP address allocations were within a single /24 range.  It just wasn’t worth it to subnet it beyond that point.

The hostnames looked something like this:  datacenter-rack-nodetypeXX, where node type would be hdn for Hadoop Datanode and hnn would be for Hadoop Namenode.  For example, dc1-a7-hdn01 or dc1-a9-hnn01.

I’ve used this design for several years now and it worked well enough, but still had the troubling aspect of requiring too much personal touch during system provisioning.

  1. Each system would need to be physically labelled, front and back so you could easily find it if you were working in the hot or cold aisle in the datacenter.
  2. I had a 1-to-1 IP mapping requirement.  If hdn01 was 192.168.0.1, then the management interface would be 192.168.16.1. A hostname would need to be assigned for each address, by hand. Having this parity made some of the configuration creation easier and scriptable.
  3. Each system would need to have its IPMI or lights-out management named, configured, and put on the network.  By hand.
  4. Each system would need to have its MAC address gathered, that wouldneedto be tracked so wecouldallocate the MAC to a specific IP address.
    1. We couldn’t get the MAC addresses from the vendor, so this was usually gathered by hand or via the management interface if it was up by this point.
    2. If we were able to get them ahead of time, we would then need to worry about placing the systems correctly, in the right order, in the racks. Not an easy task when provisioning dozens or hundreds of servers at a time.
  5. Each system would need to be configured in cobbler using the MAC address to IP mapping we previously saved.
  6. Each system would then be kickstarted via cobbler.

Understanding Hadoop nodes were cattle

Having done this provisioning process several times, it became clear that I was spending too much time investing in my pets instead of counting my cattle and herding them towards usefulness. I wanted to make changes.

  1. Each system already had a label: the serial number.  Stop labelling systems when the manufacturer already did it for me. Plus, they already place it on the front and back of the system.
  2. Use the system serial number as its hostname.  This required some tweaks to our Puppet code because our non-Hadoop system naming places location metadata in the hostname.
  3. Modify kickstart to pull the serial number off the system and use that to configure the hostname during the installation.
  4. Modify kickstart to take the serial number and issue a DNS update from the node during the post install with nsupdate.
  5. Setup a cobbler system entry that allowed for a network boot default (as opposed to a cobbler-wide default) to choose the correct kickstart configuration for any host within a given subnet.
    Cobbler System Profile for Network Default

    Cobbler System Profile for Network Default

    Cobbler Network Default

    Cobbler Network Default

    1. This appears to only work in cobbler 2.4; I attempted this with cobbler 2.6 and found that the documented functionality was broken.  I need to send a bug report after further testing to confirm it.
    2. This network default exploits a behavior of the PXE boot environment where the loader attempts to download install files based on a particular pattern, starting with system GUID, then MAC address, then IP address in hexadecimal form.  If it could find none of these things on the TFTP server, it would chop off the least significant four bits and look for that filename, repeating this until no bits were left.  Finally it would look for a default boot profile.
  6. Convert from static DHCP reservation to dynamic DHCP pools for the IP address allocation.  Kickstart would then reconfigure the system during post install to make that a static IP address.  This address is what would be registered in DNS.
    1. Each rack would have its own /26 range to work in, giving us an ample supply of addresses for the future if we decided to go from 2U to 1U servers.
  7. Configure kickstart to register this new system in cobbler during post install to force the removal of this IP address from the pool.  This would prevent DHCP from attempting to give it out again in the future.

The biggest benefit to all of this was reducing the manual work necessary during system provisioning.  I no longer needed to spend as much time at the datacenter to set up systems.  I no longer needed to care where a system was physically in a rack. I no longer needed to spend time planning, creating, and crosschecking kickstart configurations, PXE settings, IP address allocations, MAC address lists, and so on.  I could just plug a system into a rack, and as long as the switch in that rack was correctly configured with the right VLAN, IP network, and gateway, the system would just boot up and install.  We’ve even been working with the vendor to pre-configure some settings from the factory on the management interfaces that have previously required me sit in front of racks of systems for hours at a time, doing something that wasn’t easily automatable.

Using Hardware Data as Hostname

The first big step was figuring out how to pull the serial number off the system during kickstart.  I needed this information in both the pre and post install phases.  There are two ways to pull this information off the motherboard:  using dmidecode or walking the sysfs filesystem, both of which are available during pre install.  The following is a cobbler snippet used for finding this.  I call it pre_hostname_query.

#raw
#
# start of pre_hostname_query
#

L="[pre_hostname_query] "

dmi_path=/sys/devices/virtual/dmi/id
dmi_serial="${dmi_path}/chassis_serial"

serialized_hostname=""

if [[ -d "${dmi_path}" ]] ; then
 echo "$L found dmi path in sysfs"

 if [[ -f "${dmi_serial}" ]] ; then
 serialized_hostname=$(cat $dmi_serial)
 echo "$L found ${dmi_serial} = '${serialized_hostname}'"
 else
 echo "$L could not find ${dmi_serial}; fallback to dmidecode"
 serialized_hostname=$(dmidecode -s system-serial-number)
 fi 

else 
 echo "$L dmi path not found, trying dmidecode"
 serialized_hostname=$(dmidecode -s system-serial-number)
fi

if [[ -z "$serialized_hostname" ]] ; then
 exec {STDOUTBACK}>&1
 exec {STDERRBACK}>&2
 exec 1>>/dev/pts/0
 exec 2>>/dev/pts/0
 whiptail --title "Failure to find hostname" --msgbox "$L could not determine serial number of this host; dmidecode and sysfs failed" --ok-button "Reboot" 8 78 
 reboot
fi

real_hostname=$(echo ${serialized_hostname} | awk '{print tolower($0)}')

echo "$L short hostname will be ${real_hostname}"

# save the new hostname for use later.
echo $real_hostname > /tmp/real_hostname

#
# end of pre_hostname_query
#
#end raw

Something to note is my use of the #raw/#end raw tags. Cobbler uses Cheetah as the templating system.  These pragmas make the Cheetah parser bypass that section of the snippet and not parse it at all.  This makes it easier to write shell scripts without having to resort to crazy escaping of your shell variables.

This next snipped is immediately called after the previous one to make kickstart use DHCP during the install, but force a hostname to be set to the retrieved serial number.

# write out the network config so we can import it in the kickstart later.
echo "network --bootproto=dhcp --device=$iname --onboot=on --hostname=\$(cat /tmp/real_hostname)" > /tmp/pre_install_network_config

You can see that I don’t use the #raw/#end raw tags here because I want to use the $iname variable to populate the network line with the default device name I’ve set in the cobbler system profile for this network. You should note that I have to escape the dollar sign used to wrap the call to cat /tmp/real_hostname.

Simple so far.

A side question: why not MAC or IP for hostname?

I could have very-well used the MAC address or IP address converted to hadoop-192-168-0-1.  I chose to not use either. Why not?

  1. I would still have to label the systems if I ever wanted to find them.
  2. If I used MAC address, which one would I use? Systems often have more than one and that address could change if I ever have to replace a damaged NIC. This would just require excessive maintenance later.

System serial number is pretty unique and unchanging for a given piece of hardware.  Even if I change the motherboard out, most manufacturers have a process that you go through to set this information on the new board so they can better track problems under warranty.

Out of nothing comes a DNS record

Now that we have a defined and idempotent way to create a hostname for a piece of hardware, we need to register that in DNS.  Configure your favorite DNS server to allow dynamic DNS updates from clients. You should probably make sure you use DNS Transaction Signatures (TSIG) to authorize your DNS updates from clients.  For the purposes of this discussion, I’m not using TSIG because we haven’t found a great way to protect authorize access to the key during kickstart.

As part of this redesign, our standard now requires

  1. All Hadoop nodes go into the hdp.example.net subdomain.  This is so we don’t pollute other namespaces we have.
  2. AllHadoop nodes for a specific cluster, gointoasubdomain for that cluster.
    1. For example, pachyderm.hdp.example.net might be the archive cluster. dumbo.hdp.example.net might be the QA cluster.
    2. We associate human readable names with clusters so developers have an easier time referring to a set of machines.
    3. Each cluster is associated with a /22 IP network, allowing for approximately 1000 usable nodes.
    4. As previously mentioned, each rack is assigned a /26 IP network in this range so we can manage the rack topology.

We enable this setup automatically during the kickstart post install.  This snippet will create both the forward mapping (fully qualified domain name) and the reverse mapping (in-addr.arpa address).  This is a requirement for Hadoop to not make you cry.

This snippet covers a few things:

  1. It is determining which network it is in so the cluster name can be set.
  2. It deletes and then sets the current fully qualified domain name for this host.
  3. It deletes and then sets the current reverse DNS entry for this host.
  4. It checks to see if the DNS server got the updates and outputs that as logging information.
#raw

#
# network_nsupdate_hostname for hdp.example.net
#

L="[network_nsupdate_hostname]"

while : ; do
 ip=$(ip -4 -o addr show dev em1)

 localip=$(echo ${ip} | sed -e 's/\// /g' | awk '{print $4}')
 
 if [[ -z "${localip}" ]] ; then
   echo "$L localip is not set yet. got '${ip}'"
   sleep 10
   continue
 else
   break
 fi
done


# hdp.example.net networks are in a /22 giving us 16 /26's to work with.
# each rack is a /26. 
network=$(ipcalc -4 -n $localip/22 | sed -e 's/NETWORK=//')

case $network in
 192.168.16.0) cluster="pachyderm"
 ;;

 192.168.20.0) cluster="dumbo"
 ;;

 192.168.24.0) cluster="snuffleupagus"
 ;;

 *) 
  echo "$L Unknown network $network; can't figure out what cluster I'm in. Bailing out"
  exec {STDOUTBACK}>&1
  exec {STDERRBACK}>&2
  exec 1>>/dev/pts/0
  exec 2>>/dev/pts/0
  whiptail --title "Failure to find clustername" \
    --msgbox "$L could not determine cluster for this network $network" \
    --ok-button "Reboot" 8 78 
  reboot

  ;;
esac

if [[ -f /tmp/real_hostname ]] ; then
 # only do this if we've previously found a hostname based on the
 # serialnumber.
 reverseip=$(echo "${localip}." | tac -s. | tr -d "\n" )in-addr.arpa
 real_hostname=$(cat /tmp/real_hostname)

 if [[ $reverseip = ".in-addr.arpa" ]] ; then
   echo "$L could not figure out reverse ip. Did system not get ip?"
   exec {STDOUTBACK}>&1
   exec {STDERRBACK}>&2
   exec 1>>/dev/pts/0
   exec 2>>/dev/pts/0
   whiptail --title "Failure to find reverseip" \
     --msgbox "$L could not determine reverseip for this host '$localip'" \
     --ok-button "Reboot" 8 78 
   reboot
 fi

 if [[ -z "$real_hostname" ]] ; then
   echo "$L could not figure out this host's real hostname."
   exec {STDOUTBACK}>&1
   exec {STDERRBACK}>&2
   exec 1>>/dev/pts/0
   exec 2>>/dev/pts/0
   whiptail --title "Failure to find host's name" \
     --msgbox "$L could not determine hostname for this host '$localip'" \
     --ok-button "Reboot" 8 78 
   reboot
 fi

 echo "$L Found $real_hostname in cluster $cluster"

 echo "update delete $real_hostname.$cluster.hdp.example.net A
 update add $real_hostname.$cluster.hdp.example.net. 86400 A $localip
 show
 answer
 send

 update delete ${reverseip}. 86400 IN PTR ${real_hostname}.${cluster}.hdp.example.net
 update add ${reverseip}. 86400 IN PTR ${real_hostname}.${cluster}.hdp.example.net
 show 
 answer
 send" > /tmp/nsupdate_config

 echo "$L created nsupdate config. Here is the config."
 echo
 cat /tmp/nsupdate_config
 echo
 echo "$L updating dns with nsupdate config."
 nsupdate /tmp/nsupdate_config

 if [[ $? -ne 0 ]] ; then
  echo "$L Unknown nsupdate error. Halting progress"

  exec {STDOUTBACK}>&1
  exec {STDERRBACK}>&2
  exec 1>>/dev/pts/0
  exec 2>>/dev/pts/0
  whiptail --title "Failure to nsupdate" --msgbox "$L nsupdate failed for some reason $(cat /tmp/nsupdate)" --ok-button "Reboot" 8 78 
  reboot
 else 
  echo "$L nsupdate completed without error."

  sleep 30

  dig ${real_hostname}.${cluster}.hdp.example.net
 fi

fi


#
# end network_nsupdate_hostname for hdp.example.net
#

#end raw

Note that this snippet will take the currently assigned DHCP address and register a name for that.  There’s no prevention for doubly assigning different IP addresses to a given name (multiple IPs to an A record) or double assigning the same name to different reverses.  I’ve only seen it do this once and haven’t gone back to investigate further.

Somewhere after this snippet, but before we do our first puppet run, we flip the boot-time network configuration scripts from DHCP to statically assigned.  This is a cheap and fast way to get a basic file in place in case of a failure inducing a reboot.  The puppet runs will manage and configure these files in a later snippet.

if [[ -n $cluster ]] ; then
 echo "We're trying to configure a big data cluster $cluster" 

GATEWAY=`ip route get 1 | grep via | awk '{print $3}'`
IPLIST=$(ip -4 -o addr show dev em1)
LOCALIP=$(echo ${IPLIST} | sed -e 's/\// /g' | awk '{print $4}')
NETMASK=$(ipcalc -4 -n $LOCALIP/22 | sed -e 's/NETWORK=//')
 sed -i -e 's/BOOTPROTO="dhcp"/BOOTPROTO="static"/' /etc/sysconfig/network-scripts/ifcfg-em1
 echo "IPADDR=$LOCALIP" >> /etc/sysconfig/network-scripts/ifcfg-em1
 echo "NETMASK=$NETWORK" >> /etc/sysconfig/network-scripts/ifcfg-em1
 echo "GATEWAY=$GATEWAY" >> /etc/sysconfig/network
fi

Hadoop Profit!

I’m evaluating new hardware utilizing these ideas.  So far, so good.  I’m still finding sticking points that I want to deal with (such as the DNS TSIG keys or the cobbler network default bug in 2.6).  This process is not complete by any means.  The more I can automate the provisioning aspects of this environment, the easier it will be to hand off some of this work to other people on our team or even just spec it out and pay the manufacturer to do it as part of the delivery process.

The next step is to go back and introduce my serverspec work into the system validation process so that we have a good smoke test of the hardware prior to enabling the Hadoop nodes for real work.

Verify Hadoop Cluster node health with Serverspec

One of the biggest challenges I have running Hadoop clusters is constantly validating that the health and well-being of the cluster meets my standards for operation.  Hadoop, like any large software ecosystem, is composed of many layers of technologies, starting from the physical machine, up into the operating system kernel, the distributed filesystem layer, the […]

Transparent Huge Pages on Hadoop makes me sad.

Today I (re)learned that I should pay attention to the details of Linux kernel bugs associated with my favorite distribution. Especially if I’m working on CentOS 6/Red Hat Enterprise Linux (RHEL) 6 nodes running Transparent Huge Pages on Hadoop workloads. I was investigating some performance issues on our largest Hadoop cluster related to Datanode I/O […]

What’s in my datacenter tool kit?

Every Operations person or datacenter (DC) junkie that I know has a datacenter tool kit of some sort, containing their favorite bits of gear for doing work inside the cold, lonely world of the datacenter. Now, one would like to think that each company stocks the right tools for their folks to work, but tools […]

HBase Motel: SPLITS check in but don’t check out

In HBase, the Master process will periodically call for the splitting of a region if it becomes too large. Normally, this happens automatically, though you can manually trigger a split. In our case, we rarely do an explicit region split by hand. A new Master SPLIT behavior: let’s investigate We have an older HBase cluster […]

Restarting HBase Regionservers using JSON and jq

We run HBase as part of our Hadoop cluster. HBase sits on top of HDFS and is split into two parts: the HBase Master and the HBase Regionservers. The master coordinates which regionservers are in control of each specific region. Automating Recovery Responses We periodically have to do some minor maintenance and upkeep, including restarting […]

My Hadoop cluster data needs no RAID!

One of the operational challenges in introducing Hadoop to traditional IT and Enterprise operations is understanding when to break one of our sacred IT mantras: Thou shalt always RAID your data. Never shalt thou install a system without RAID. One shall be your RAID if thou seekest performance and redundancy sparing no expense. Five shall […]

Improving Hadoop datanode disk fault tolerance

By design, Hadoop is meant to tolerate failures in a responsible manner. One of those failure modes is for an HDFS datanode to go off line because it lost a data disk. By default, the datanode process will not tolerate any disk failures before shutting itself off. When this happens, the HDFS namenode discovers that […]

Rebooting Linux temporarily loses (some) limits.conf settings

Like any wildly managed environment, you probably have to create custom-defined settings in your /etc/security/limits.conf because of application specific requirements. Maybe you have to allow for more open files. Maybe you have to reduce the memory allowed to a process. Or maybe you just like being ultra-hardcore in defining exactly what a user can do. […]

Augeas made my grub.conf too quiet!

A reader contacted me after working through the examples in my last previous post on Augeas. He was having a difficult time figuring out how to add a valueless key back into the kernel command line. This was the opposite of what I was doing with quiet and rhgb Many thanks. I have been pounding […]