in DevOps

Docker for Mac Tips for Troubleshooting Container Problems

239hresized

I’ve used Docker for Mac since the Beta release opened to wider audiences. With the rapid prototyping I’m doing on Hadoop environments, I’m finding it great for providing quick environments to test out theories.

Problem: How do you access the Docker for Mac VM?

The problem with a black box is not being able to easily get inside to diagnose weird behavior. Previously, I was using boot2docker on Virtualbox to run containers on my Mac. This was pretty easy to get into: you could pop open the console of the VM in the Virtualbox control panel and do what you need to do.

With the Docker for Mac install, it’s a bit more obtuse. I ran across this following method while digging around Github issues for the second problem below.

Solution: Accessing the VM

 

$ screen ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty

Pretty simple. I was able to poke around the VM pretty easily once I figured this out.

For example, let’s take a quick look at the output of df (which is relevant for the next problem).

/ # df 
Filesystem           1K-blocks      Used Available Use% Mounted on
tmpfs                  1022944    161836    861108  16% /
tmpfs                   204592       192    204400   0% /run
cgroup_root              10240         0     10240   0% /sys/fs/cgroup
dev                      10240         0     10240   0% /dev
shm                    1022944         0   1022944   0% /dev/shm
/dev/vda1             19049892  17360568    698600  96% /var
/dev/vda1             19049892  17360568    698600  96% /var
osxfs                243941376 181997504  61687872  75% /Users
osxfs                243941376 181997504  61687872  75% /Volumes
osxfs                243941376 181997504  61687872  75% /tmp
osxfs                243941376 181997504  61687872  75% /private
osxfs                243941376 181997504  61687872  75% /host_docker_app
/dev/vda1             19049892  17360568    698600  96% /run/log
osxfs                243941376 181997504  61687872  75% /var/log
/dev/vda1             19049892  17360568    698600  96% /var/lib/docker/aufs

From here, we can see the /dev/vda1 volume which maps to the underlying Qcow2 VM image. We can see the osxfs filesystems which are overlaid volumes mounted from the host to make it easier to pass data into and out of the container. Pretty nice. I’m not sure why we see double mounts for things like /var. I suspect it’s just an oddity in the way this VM works.

To exit, kill your screen session by typing Control-A then k and answer “yes”.

This leads me into the second problem I was encountering today.

Problem: Filesystem full?

While creating some pretty large containers for a test, I encountered docker build failures pointing to some filesystem being full. This error was coming from yum as it was building out the required packages for the container.

Error Summary
-------------
Disk Requirements:
 At least 442MB more space needed on the / filesystem.

My first inclination was to make sure my Mac wasn’t full; it wasn’t. The next step was to clear out old containers that were no longer needed. I know the Qcow2 image is pretty small at 20 gigabytes, so that doesn’t leave much space if I’ve made a few of these large builds. I cleared out a few gigabytes of old containers and their images, but it didn’t help. I kept running into these space issues despite having cleaned out everything but a base OS container.

Going back to that Github issue, I was able to use the screen hack to get in and found that the Docker container cache inside the VM was on a volume that was full.

/var/lib/docker/aufs # du -smx *
16717 diff
1 layers
1 mnt

The bulk of the disk usage was in /var/lib/docker/aufs. This is where the container diffs get cached. Reading through various other Github issues, it sounds like these can occasionally orphan themselves which prevents them from getting cleaned up properly.

Well, that’s likely my problem here, but how do I get that cleaned up?

Solution: Garbage collect the container diffs

This is where the spotify/docker-gc tools come in. I ran across this in some of the discussions. It looked pretty straightforward and I was fine running it. If it screwed some of my existing containers up, I could just rebuild them. I ended up using the docker image for this.

$ docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v /etc:/etc spotify/docker-gc

It dumped a bunch of output like the following.

Unable to find image 'spotify/docker-gc:latest' locally
latest: Pulling from spotify/docker-gc

2352f4f477e3: Pull complete
be723c23f85f: Pull complete
40827ff2c763: Pull complete
Digest: sha256:0a87fcd289d526ec534d23a7c82f7ea27cecafe60c650eb8b1c877ef3bfc2a88
Status: Downloaded newer image for spotify/docker-gc:latest
Container running 45832fb87d28d920cd3774618e09f1505fd9e5fd40cf1d509e037d6a10932064 /gigantic_colden
Container not running 03476687dad17a0a4ba53d0d46679711dfc43afae891fc43287d65561e3b047a /berserk_banach
Container not running 03d4297dc90d091ea19858eab097e62ab61b1e4c54b2ce707ade76f49df62264 /peaceful_roentgen
Container not running 04bf8b01bdf2f7348f22453ccf8a4b6dc485295477031b2108bd59c20f013c1a /hungry_hawking
Container not running 09cad64371cb1e3cdcfb898d4ac3736e4ee9eae53dc3a3760975ed684c32db3e /determined_lamarr
.
.
[lots of output deleted]
.
.
Removing containers fa23ff819a5bfb66e931079284106f6584c0c2c7e3e208610b8526e8c4a3e703 /pedantic_franklin
Removing containers fd073598306b884c739d2fd99d79056958ac4f4a786b305d0be72061702b7551 /naughty_swartz
Removing containers fe9250b1c4ca2512c1d1b7464ec58d40f2c457315cc39622f4ac12af5abd9c49 /goofy_hypatia
Removing image sha256:04dac6a8672c0d311fa1ccd323e5826d84320c9032761c70f98a88d72238c6cf [ansible-aws:latest]
Removing image sha256:6e5c17caa1307c4a8c5fe2f50d9e24bf3a3ec864a9bca1045571b7bb42b1b546 [xenapi:latest]
Removing image sha256:778a53015523d89bd807dab131cf9b8bb65f661ddaed8e24038817cdca42d576 [centos:latest]
Removing image sha256:bd9d6812fdd0187ffeb1737a3a601c268eeded59454186b1f9ceafd1f00e07df [quay.io/dtan4/terraforming:latest]
Removing image sha256:d0a31e3494fe61ed2ff0387cd0e71e237394c413f7a0dfca9a33b1319bab499c [centos:6]
Removing image sha256:d0e7f81ca65cdd391b6eb3dd3ce2454a575023156cd932ee4a58f188436bc5e0 [centos:7]
Removing image sha256:f753707788c5c100f194ce0a73058faae1a457774efcda6c1469544a114f8644 [ubuntu:latest]

Looking at df now showed that the space utilization dropped down from 96% to 17%. Much better. Running my docker build showed that it was succeeding again.

Clean early, clean often

This problem appears often in the Docker community from what I can tell. Spotify’s tool is pretty good and tries to be as safe as possible. I was happy to use it in my test environment. If you decide to use it for yours, you should test it before running it against your production containers. You never know what might happen.

A followup on the strange stunnel behavior in docker

This is a quick followup to the strange stunnel behavior I was seeing that I wrote about previously. After discussing the issue with a colleague, we came up with two different solutions to this problem with stunnel writing to /dev/console inside a docker container. Indirect route with docker exec In his method, we invoke the container and […]

Strange stunnel debug logging behavior in docker

I’ve been playing with xenserver lately to quickly model small Hadoop clusters. One of the frustrating things about xenserver is the lack of good graphical user interfaces that provide for a minimal amount of automation. This means I’m frequently dropping to the command line on the xenserver master and running the xe tools by hand […]

Containing a snakebite with python pex

I’ve used Hadoop for several years now. One of the most frustrating parts of using Hadoop is the time it takes to start-up the Java HDFS client to run simple tasks.  Even listing a directory can take several seconds because of the startup cost associated with launching the JVM. In 2013, Spotify open sourced a pure […]

Creating RPMS with fpm and docker

Now and again, you need to create RPMS of third-party tools, such as Python libraries or Ruby gems.  The most effective (and best!) way to do this is with fpm. Historically, I have built RPMS using a few dedicated virtual machines for CentOS5 and CentOS6-specific builds. These virtual machines have gotten crufty with all the various libraries installed. Wouldn’t it be nice to have […]

Hadoop distcp network failures with WebHDFS

… or why do I get “Cannot assign requested address” errors?! At some point or another, every Hadoop Operations person will have to copy large amounts of data from one cluster to another. This is a trivial task thanks to hadoop distcp.  But, it is not without its quirks and issues. I will discuss a […]

Google Chrome, SPNEGO, and WebHDFS on Hadoop

I’ve previously noted that we’re using Kerberos to handle the authentication on our Hadoop clusters.  One of the features that we had previously not had because of configuration issues, was the ability to use WebHDFS to browse around the cluster.  With our latest cluster, we figured out the right incantation of Kerberos and SPNEGO configurations […]

Oozie Install, why do you hate me?

We’ve been slowly migrating towards managing our Hadoop infrastructure with Cloudera Manager (CM). Our latest cluster is entirely managed via CM, enabling us to easily wire up features that we previously had no need for.  One of the new features we wanted to work with was Oozie. No problem, right?  The process is pretty simple. […]

5-whys at Hubspot: an Introspective response

Ran across Post mortems at Hubspot: What I learned from 250 Whys today.  This is a good review of Hubspot’s experience with 5-whys to facilitate post-mortems. The part that most caught my eye was the idea that “slow down” probably should not be the initial response to development velocity and mistakes if you don’t also consider the cost […]

Treat your Hadoop nodes like cattle

I’ve built compute clusters of various sizes, from hundreds to tens of thousands of systems, for almost two decades now.  One of the things I learned early on is that, for compute clusters, you want to treat each system as cookie cutter as possible.  By that, I mean there should be a minimal set of differences […]