in Hadoopery

Containing a snakebite with python pex

I’ve used Hadoop for several years now. One of the most frustrating parts of using Hadoop is the time it takes to start-up the Java HDFS client to run simple tasks.  Even listing a directory can take several seconds because of the startup cost associated with launching the JVM.

In 2013, Spotify open sourced a pure Python implementation of the HFDS client called snakebite. You can find it on Spotify’s github. Exciting times, for sure.  What was a series of slow, rickety (and error-prone) shell scripts wrapped around the hadoop fs commands could be turned into workable Python. The only downside was that the first implementation of snakebite did not support Kerberos. This was a major downside for those of us using Kerberized Hadoop environments because of security requirements.

The coming of snakebite

A few months ago, Spotify released support for Kerberos. Happy days for Hadoop Operations folks like me having to deal with ghetto scripts. One of the first things I did was to figure out how to create a reliable and reproducible install of snakebite on my Cloudera Hadoop cluster.

I could go the simple route and do the pip install thing, but that would require installing all the development libraries and tool chains necessary for compiling the Kerberos backend in snakebite. For a one-time install on each system, this added a significant amount of unnecessary software on the cluster.

The second route was to use fpm and build RPMS that could be used both on my Hadoop cluster and on other systems in my environment. This was pretty simple (and the basis for my recent fpm and docker post). I added the RPMS to my yum repo, built a quick puppet module to load them as needed and …

… ran into an operational failure when attempting to install them on servers that happened to run the Mesosphere Mesos RPMS.

Mesosphere, we have a problem!

Looking at my snakebite RPMS, I found the following:

file /usr/lib/python2.6/site-packages/google/protobuf/text_format.py 
  from install of python-protobuf-3.0.0b2-1.noarch conflicts with file 
  from package mesos-0.23.1-0.2.61.centos65.x86_64
file /usr/lib/python2.6/site-packages/google/protobuf/text_format.pyc 
  from install of python-protobuf-3.0.0b2-1.noarch conflicts with file 
  from package mesos-0.23.1-0.2.61.centos65.x86_64

It turns out that Mesos includes it’s own copy of the Python protobuf library, installing it into the Python site-packages directory. But, snakebite also wants it’s own version of protobuf installed (from a separate python-protobuf RPM that I built). This presents a significant problem, and not one I want to attempt to resolve with virtualenv.

Thankfully, I brought this issue up with one of my coworkers who suggested I try using Python pex.

What is pex?

pex is a tool for Python that implements the Python EXecutable environment. The easiest way to think of these is like a Python virtualenv equivalent of a Java JAR or WAR file: it’s a compressed copy of everything needed to run a self-contained tool or app. Plus, it can be created in a way that is portable across operating systems. Neat, huh?

Why use pex?

One of the driving ideas behind a pex file is the ability to easily deploy a bundle of code with something as simple as a /bin/cp. You get an isolated, executable environment without a lot of dependency fuss to go through.

If you still don’t understand, there is a great lightning talk from Twitter called WTF is PEX; it’s about 15 minutes long and breaks down the important parts of pex.

Prepping for pex

In an earlier post, I talked about using docker to create clean build environments using fpm. I re-use that docker image here because of how simple it is to spin up and add my build dependencies to it.

In this case, because the pex command is not already installed, I add it into the running docker container, then add in the development libraries necessary for building the snakebite library and its Kerberos dependencies.

$ docker run -ti -v /tmp/fpmbuild:/tmp/fpmbuild fpm-centos6
[root@b508004a7709 fpmbuild]# pip install pex
[root@b508004a7709 fpmbuild]# yum install -y python-devel cyrus-sasl-devel krb5-devel

Running pex on snakebite

Here, I’m going to build a pex-ified version of the snakebite library.  A few things to note:

  1. I need to build snakebite with Kerberos support.
  2. To get Kerberos support, you would normally do pip install snakebite[kerberos].
    1. The [kerberos] string ends up being a build option, or “extra” for snakebite’s setup.py
  3. I want to build a pex-ified version of the snakebite command that comes with the snakebite library, so I need a copy of that script in my working dir.
[root@b508004a7709 fpmbuild]# pex -v 'snakebite[kerberos]' -o snakebite.pex -c snakebite
pex: Building pex
pex: Building pex :: Resolving distributions
pex: Building pex :: Resolving distributions :: Packaging snakebite
pex: Building pex :: Resolving distributions :: Packaging sasl
pex: Building pex :: Resolving distributions :: Packaging python-krbV
  argparse 1.4.0
  snakebite 2.7.2
  six 1.10.0
  sasl 0.1.3
  python-krbV 1.0.90
  protobuf 3.0.0b2
  setuptools 19.2
pex: Building pex: 124009.6ms
pex:   Resolving distributions: 123946.4ms
pex:       Packaging snakebite: 16495.9ms
pex:       Packaging sasl: 17479.8ms
pex:       Packaging python-krbV: 17198.1ms
Saving PEX file to snakebite.pex
[root@b508004a7709 fpmbuild]#

The options and arguments passed to pex do the following:

  • turn on verbosity:  -v
  • specify which module to start loading: snakebite[kerberos]
  • the name of the pex output file: -o snakebite.pex
  • the name of the script to use as the default entry point for the pex file: -c snakebite
    • this is the script that will run when you run snakebite.pex
    • this is called snakebite.py in the working directory, but .py is not required in the command argument.

Testing out the pex-ified snakebite

Once the snakebite.pex is built, I tested to see what it was doing. I expected it to do the same thing as the snakebite command that comes in the library. That sets up a Python HDFS client and lets you do file operations. You can see that I did a directory listing below.

[hcoyote@hadoopclient ~]$ export HADOOP_CONF_DIR=/etc/hadoop-confs/cdh5/
[hcoyote@hadoopclient ~]$ ./snakebite.pex ls
Found 4 items
drwx------ - hcoyote users         0 2016-01-12 18:34 /user/hcoyote/.Trash
drwx------ - hcoyote users         0 2016-01-13 01:58 /user/hcoyote/.staging
-rw-r--r-- 3 hcoyote users 598758188 2014-04-25 15:03 /user/hcoyote/2014-04-25-fetchImage-fsimage.tsv
-rw-r--r-- 3 hcoyote users   5184637 2015-02-20 12:17 /user/hcoyote/SecurityAuth-hdfs.audit.gz

Next, I want to make sure that the snakebite.pex command isn’t lying to me, so I try doing the same operation using the Java HDFS client.

[hcoyote@hadoopclient ~]$ export HDP_DIR=/home/hcoyote/hadoop-2.6.0-cdh5.4.9
[hcoyote@hadoopclient ~]$ export HADOOP_CONF_DIR=/etc/hadoop-confs/cdh5
[hcoyote@hadoopclient ~]$ export CDH_MR2_HOME=${HDP_DIR}
[hcoyote@hadoopclient ~]$ export JAVA_HOME=/usr/java/jdk_x64
[hcoyote@hadoopclient ~]$ hadoop-2.6.0-cdh5.4.9/bin/hadoop fs -ls
Found 4 items
drwx------ - hcoyote users         0 2016-01-12 18:34 .Trash
drwx------ - hcoyote users         0 2016-01-13 02:05 .staging
-rw-r--r-- 3 hcoyote users 598758188 2014-04-25 15:03 2014-04-25-fetchImage-fsimage.tsv
-rw-r--r-- 3 hcoyote users   5184637 2015-02-20 12:17 SecurityAuth-hdfs.audit.gz

Achievement Unlocked!

Some quick testing shows that the Java client consistently takes 3-4 seconds to return the directory listing, but the pex-ified Python client is an order of magnitude smaller. That’s a win for both me and my users, for doing simple file system operations. The bigger win is that I now have a mechanism for creating portable tools that use the snakebite library on systems that may also have other conflicting dependencies and I don’t have to mess with building out Python virtual environments to get this working.

 

Creating RPMS with fpm and docker

Now and again, you need to create RPMS of third-party tools, such as Python libraries or Ruby gems.  The most effective (and best!) way to do this is with fpm. Historically, I have built RPMS using a few dedicated virtual machines for CentOS5 and CentOS6-specific builds. These virtual machines have gotten crufty with all the various libraries installed. Wouldn’t it be nice to have […]

Hadoop distcp network failures with WebHDFS

… or why do I get “Cannot assign requested address” errors?! At some point or another, every Hadoop Operations person will have to copy large amounts of data from one cluster to another. This is a trivial task thanks to hadoop distcp.  But, it is not without its quirks and issues. I will discuss a […]

Google Chrome, SPNEGO, and WebHDFS on Hadoop

I’ve previously noted that we’re using Kerberos to handle the authentication on our Hadoop clusters.  One of the features that we had previously not had because of configuration issues, was the ability to use WebHDFS to browse around the cluster.  With our latest cluster, we figured out the right incantation of Kerberos and SPNEGO configurations […]

Oozie Install, why do you hate me?

We’ve been slowly migrating towards managing our Hadoop infrastructure with Cloudera Manager (CM). Our latest cluster is entirely managed via CM, enabling us to easily wire up features that we previously had no need for.  One of the new features we wanted to work with was Oozie. No problem, right?  The process is pretty simple. […]

5-whys at Hubspot: an Introspective response

Ran across Post mortems at Hubspot: What I learned from 250 Whys today.  This is a good review of Hubspot’s experience with 5-whys to facilitate post-mortems. The part that most caught my eye was the idea that “slow down” probably should not be the initial response to development velocity and mistakes if you don’t also consider the cost […]

Treat your Hadoop nodes like cattle

I’ve built compute clusters of various sizes, from hundreds to tens of thousands of systems, for almost two decades now.  One of the things I learned early on is that, for compute clusters, you want to treat each system as cookie cutter as possible.  By that, I mean there should be a minimal set of differences […]

Verify Hadoop Cluster node health with Serverspec

One of the biggest challenges I have running Hadoop clusters is constantly validating that the health and well-being of the cluster meets my standards for operation.  Hadoop, like any large software ecosystem, is composed of many layers of technologies, starting from the physical machine, up into the operating system kernel, the distributed filesystem layer, the […]

Transparent Huge Pages on Hadoop makes me sad.

Today I (re)learned that I should pay attention to the details of Linux kernel bugs associated with my favorite distribution. Especially if I’m working on CentOS 6/Red Hat Enterprise Linux (RHEL) 6 nodes running Transparent Huge Pages on Hadoop workloads. I was investigating some performance issues on our largest Hadoop cluster related to Datanode I/O […]

What’s in my datacenter tool kit?

Every Operations person or datacenter (DC) junkie that I know has a datacenter tool kit of some sort, containing their favorite bits of gear for doing work inside the cold, lonely world of the datacenter. Now, one would like to think that each company stocks the right tools for their folks to work, but tools […]