in Hadoopery

Containing a snakebite with python pex

I’ve used Hadoop for several years now. One of the most frustrating parts of using Hadoop is the time it takes to start-up the Java HDFS client to run simple tasks.  Even listing a directory can take several seconds because of the startup cost associated with launching the JVM.

In 2013, Spotify open sourced a pure Python implementation of the HFDS client called snakebite. You can find it on Spotify’s github. Exciting times, for sure.  What was a series of slow, rickety (and error-prone) shell scripts wrapped around the hadoop fs commands could be turned into workable Python. The only downside was that the first implementation of snakebite did not support Kerberos. This was a major downside for those of us using Kerberized Hadoop environments because of security requirements.

The coming of snakebite

A few months ago, Spotify released support for Kerberos. Happy days for Hadoop Operations folks like me having to deal with ghetto scripts. One of the first things I did was to figure out how to create a reliable and reproducible install of snakebite on my Cloudera Hadoop cluster.

I could go the simple route and do the pip install thing, but that would require installing all the development libraries and tool chains necessary for compiling the Kerberos backend in snakebite. For a one-time install on each system, this added a significant amount of unnecessary software on the cluster.

The second route was to use fpm and build RPMS that could be used both on my Hadoop cluster and on other systems in my environment. This was pretty simple (and the basis for my recent fpm and docker post). I added the RPMS to my yum repo, built a quick puppet module to load them as needed and …

… ran into an operational failure when attempting to install them on servers that happened to run the Mesosphere Mesos RPMS.

Mesosphere, we have a problem!

Looking at my snakebite RPMS, I found the following:

file /usr/lib/python2.6/site-packages/google/protobuf/text_format.py 
  from install of python-protobuf-3.0.0b2-1.noarch conflicts with file 
  from package mesos-0.23.1-0.2.61.centos65.x86_64
file /usr/lib/python2.6/site-packages/google/protobuf/text_format.pyc 
  from install of python-protobuf-3.0.0b2-1.noarch conflicts with file 
  from package mesos-0.23.1-0.2.61.centos65.x86_64

It turns out that Mesos includes it’s own copy of the Python protobuf library, installing it into the Python site-packages directory. But, snakebite also wants it’s own version of protobuf installed (from a separate python-protobuf RPM that I built). This presents a significant problem, and not one I want to attempt to resolve with virtualenv.

Thankfully, I brought this issue up with one of my coworkers who suggested I try using Python pex.

What is pex?

pex is a tool for Python that implements the Python EXecutable environment. The easiest way to think of these is like a Python virtualenv equivalent of a Java JAR or WAR file: it’s a compressed copy of everything needed to run a self-contained tool or app. Plus, it can be created in a way that is portable across operating systems. Neat, huh?

Why use pex?

One of the driving ideas behind a pex file is the ability to easily deploy a bundle of code with something as simple as a /bin/cp. You get an isolated, executable environment without a lot of dependency fuss to go through.

If you still don’t understand, there is a great lightning talk from Twitter called WTF is PEX; it’s about 15 minutes long and breaks down the important parts of pex.

Prepping for pex

In an earlier post, I talked about using docker to create clean build environments using fpm. I re-use that docker image here because of how simple it is to spin up and add my build dependencies to it.

In this case, because the pex command is not already installed, I add it into the running docker container, then add in the development libraries necessary for building the snakebite library and its Kerberos dependencies.

$ docker run -ti -v /tmp/fpmbuild:/tmp/fpmbuild fpm-centos6
[root@b508004a7709 fpmbuild]# pip install pex
[root@b508004a7709 fpmbuild]# yum install -y python-devel cyrus-sasl-devel krb5-devel

Running pex on snakebite

Here, I’m going to build a pex-ified version of the snakebite library.  A few things to note:

  1. I need to build snakebite with Kerberos support.
  2. To get Kerberos support, you would normally do pip install snakebite[kerberos].
    1. The [kerberos] string ends up being a build option, or “extra” for snakebite’s setup.py
  3. I want to build a pex-ified version of the snakebite command that comes with the snakebite library, so I need a copy of that script in my working dir.
[root@b508004a7709 fpmbuild]# pex -v 'snakebite[kerberos]' -o snakebite.pex -c snakebite
pex: Building pex
pex: Building pex :: Resolving distributions
pex: Building pex :: Resolving distributions :: Packaging snakebite
pex: Building pex :: Resolving distributions :: Packaging sasl
pex: Building pex :: Resolving distributions :: Packaging python-krbV
  argparse 1.4.0
  snakebite 2.7.2
  six 1.10.0
  sasl 0.1.3
  python-krbV 1.0.90
  protobuf 3.0.0b2
  setuptools 19.2
pex: Building pex: 124009.6ms
pex:   Resolving distributions: 123946.4ms
pex:       Packaging snakebite: 16495.9ms
pex:       Packaging sasl: 17479.8ms
pex:       Packaging python-krbV: 17198.1ms
Saving PEX file to snakebite.pex
[root@b508004a7709 fpmbuild]#

The options and arguments passed to pex do the following:

  • turn on verbosity:  -v
  • specify which module to start loading: snakebite[kerberos]
  • the name of the pex output file: -o snakebite.pex
  • the name of the script to use as the default entry point for the pex file: -c snakebite
    • this is the script that will run when you run snakebite.pex
    • this is called snakebite.py in the working directory, but .py is not required in the command argument.

Testing out the pex-ified snakebite

Once the snakebite.pex is built, I tested to see what it was doing. I expected it to do the same thing as the snakebite command that comes in the library. That sets up a Python HDFS client and lets you do file operations. You can see that I did a directory listing below.

[hcoyote@hadoopclient ~]$ export HADOOP_CONF_DIR=/etc/hadoop-confs/cdh5/
[hcoyote@hadoopclient ~]$ ./snakebite.pex ls
Found 4 items
drwx------ - hcoyote users         0 2016-01-12 18:34 /user/hcoyote/.Trash
drwx------ - hcoyote users         0 2016-01-13 01:58 /user/hcoyote/.staging
-rw-r--r-- 3 hcoyote users 598758188 2014-04-25 15:03 /user/hcoyote/2014-04-25-fetchImage-fsimage.tsv
-rw-r--r-- 3 hcoyote users   5184637 2015-02-20 12:17 /user/hcoyote/SecurityAuth-hdfs.audit.gz

Next, I want to make sure that the snakebite.pex command isn’t lying to me, so I try doing the same operation using the Java HDFS client.

[hcoyote@hadoopclient ~]$ export HDP_DIR=/home/hcoyote/hadoop-2.6.0-cdh5.4.9
[hcoyote@hadoopclient ~]$ export HADOOP_CONF_DIR=/etc/hadoop-confs/cdh5
[hcoyote@hadoopclient ~]$ export CDH_MR2_HOME=${HDP_DIR}
[hcoyote@hadoopclient ~]$ export JAVA_HOME=/usr/java/jdk_x64
[hcoyote@hadoopclient ~]$ hadoop-2.6.0-cdh5.4.9/bin/hadoop fs -ls
Found 4 items
drwx------ - hcoyote users         0 2016-01-12 18:34 .Trash
drwx------ - hcoyote users         0 2016-01-13 02:05 .staging
-rw-r--r-- 3 hcoyote users 598758188 2014-04-25 15:03 2014-04-25-fetchImage-fsimage.tsv
-rw-r--r-- 3 hcoyote users   5184637 2015-02-20 12:17 SecurityAuth-hdfs.audit.gz

Achievement Unlocked!

Some quick testing shows that the Java client consistently takes 3-4 seconds to return the directory listing, but the pex-ified Python client is an order of magnitude smaller. That’s a win for both me and my users, for doing simple file system operations. The bigger win is that I now have a mechanism for creating portable tools that use the snakebite library on systems that may also have other conflicting dependencies and I don’t have to mess with building out Python virtual environments to get this working.

 

Travis Campbell
Staff Systems Engineer at ghostar
Travis Campbell is a seasoned Linux Systems Engineer with nearly two decades of experience, ranging from dozens to tens of thousands of systems in the semiconductor industry, higher education, and high volume sites on the web. His current focus is on High Performance Computing, Big Data environments, and large scale web architectures.