in DevOps, Hadoopery

Hadoop, facter, and the puppet marionette

I’ve been working with puppet a lot lately.  A lot.  It’s part of my job.  We’ve been setting up a new hadoop cluster in our Xen environment.  Nothing big.  It started out with 4 nodes, all configured the same way (3 drives each).  We added an additional 2 nodes with 2 drives each.  This, of course, broke how we were having to manage the hadoop configuration files.  I spent the last few days figuring out how to create a custom Puppet fact (for our Puppet 2.6.4 installation) that would search for the HDFS drives on our nodes and configure the hdfs-site.xml and mapred-site.xml appropriately.

First, in our hadoop module, I created the modules/hadoop/lib/facter path and added the following facter code to hdfs_path.rb in there.

require 'facter'

paths = []

if FileTest.exists?("/sbin/e2label")


    drives = Dir.glob("/dev/sd?1")
    drives += Dir.glob("/dev/*vg/hdfs*_lv")

    drives.each do |drive|
        #if FileTest.exists?("/dev/#{drive}1")
        if FileTest.exists?(drive)
            output = %x{/sbin/e2label #{drive} 2>/dev/null}.chomp

            # only run this fact if we find hdfs in the label name.
            if output =~ /hdfs/ and output =~ /_/
                device = output.split("_")[1]
                devicenumber = device.split("hdfs")[1]
                path = "/hdfs/" + devicenumber
                paths.push(path)
                Facter.add("hdfs_#{device}") do
                    setcode do
                        path
                    end
                end
            end
        end
    end

    allpaths = paths.join(",")
    Facter.add("hdfs_all_paths") do
        setcode do
            allpaths
        end
    end
end

This would look for drives on the hadoop node that had an ext3 label of hostname_hdfs#. For example, one of these would be vhdn01_hdfs1. The fact would be populated with the value /hdfs/1. The facter module would create one entry of these per device found. As well, it would create a fact called hdfs_all_paths that contained a comma separated list of every path we found. These would be used in the templates for mapred-site.xml and hdfs-site.xml to setup the list of paths we’re using.

In our template, we created

<property>
 <name>dfs.data.dir</name>
 <value><%= hdfs_all_paths.split(",").map { |n| n + "/hdfs"}.join(",") -%></value>
 <final>true</final>
</property>

which creates a comma separated list of paths and tacks on /hdfs to each item. For example, this takes /hdfs/1,/hdfs/2 and turns it into /hdfs/1/hdfs,/hdfs/2/hdfs.

Some things to note:  I run in an agent/master setup.  In order for the fact to be sent down to the agent, we need to enable pluginsync on both ends.  You can do this by adding the option to your puppet.conf.  It needs to go in the [main] section.  For example.

[main]
pluginsync = true

The other caveat I ran into is that in order for this to work, you need to make sure you have the fact on the client system before you use it in your catalog (such as in your templates), otherwise your catalog might not compile when you go to run your agent.

Once I got all these things figure out, I got a much more easily configured hadoop datanode setup.

Travis Campbell
Staff Systems Engineer at ghostar
Travis Campbell is a seasoned Linux Systems Engineer with nearly two decades of experience, ranging from dozens to tens of thousands of systems in the semiconductor industry, higher education, and high volume sites on the web. His current focus is on High Performance Computing, Big Data environments, and large scale web architectures.