in DevOps

5-whys at Hubspot: an Introspective response

Ran across Post mortems at Hubspot: What I learned from 250 Whys today.  This is a good review of Hubspot’s experience with 5-whys to facilitate post-mortems. The part that most caught my eye was the idea that “slow down” probably should not be the initial response to development velocity and mistakes if you don’t also consider the cost of slow down.

Slow down is optional

In addition to forbidding a future in which we’re less stupid, it’s also useful to try to take “slow down” off the table at the outset. […] That’s not strictly a part of 5 Whys, but people tend to underestimate the economic costs of slowing down, so it’s good to make sure that you don’t retrospective your way into waterfall. HubSpot has a pretty deep cultural commitment to velocity, but even so, I’ve found it useful to say this out loud on occasion.

Considering cost is hard to do when you feel overwhelmed by just the sheer amount of change that occurs.  For my environment, forward progress can be fast and furious as we scale out the architecture and add people to the organization, both in Engineering and Operations.

I wouldn’t describe it as “move fast and break things.”  It’s more about figuring out the right way to enable Engineering to do its job without making Operations a choke point, inadvertent or otherwise.  Bugs happen.  Mistakes happen.

The question is, when is it proper to stop progress until a fix appears versus incrementally moving forward and avoiding the failure mode trigger until a fix appears. No one likes stepping on land mines.  Let’s avoid them if we can.

In response to Root-cause vs Broadest Fix

When I lead them, I say something like “We’re not really looking for ‘root causes’ so much as identifying the areas where an improvement would prevent the broadest classes of problems going ahead.”

I find that perspective useful — instead of people debating “Which of these two causes is the real one?” you’re asking them “If we made A slightly better or B slightly better, which would prevent the broadest class of future problems?”

The idea of “broadest fix” was also useful to consider, especially from an Ops perspective, where we (for good or bad) tend to fixate on “how do we prevent this specific error/incident from happening again” instead of preventing its general class of failure mode. Eventually we get there, but I think we focus too long on the individual instances of a problem before we figure out there’s a larger or systemic issue at hand.

I’ve always liked the idea of using 5-whys for doing root-cause analysis.  I had a mentor who was always saying that, in almost all cases, you can live through one mistake and that it often took two or more before the issue became cataclysmic.  The combination of multiple minor, sometimes unrelated, mistakes are what lead to a systemic failure.  Any one mistake may have been self-correcting, if occurring alone.  But when considering the event as a whole, each additional mistake confers more and more risk, leading to the unhappy moment where you’re updating a status page describing that your service is down.

It helps me think of more complete responses to incidents if I can look at it from the standpoint that there may not be one underlying cause, but a set of contributing factors.  The correct solution for preventing the problem may require not focusing on a single mistake, but looking for commonalities among them, and solving that.

Treat your Hadoop nodes like cattle

I’ve built compute clusters of various sizes, from hundreds to tens of thousands of systems, for almost two decades now.  One of the things I learned early on is that, for compute clusters, you want to treat each system as cookie cutter as possible.  By that, I mean there should be a minimal set of differences […]

Verify Hadoop Cluster node health with Serverspec

One of the biggest challenges I have running Hadoop clusters is constantly validating that the health and well-being of the cluster meets my standards for operation.  Hadoop, like any large software ecosystem, is composed of many layers of technologies, starting from the physical machine, up into the operating system kernel, the distributed filesystem layer, the […]

Transparent Huge Pages on Hadoop makes me sad.

Today I (re)learned that I should pay attention to the details of Linux kernel bugs associated with my favorite distribution. Especially if I’m working on CentOS 6/Red Hat Enterprise Linux (RHEL) 6 nodes running Transparent Huge Pages on Hadoop workloads. I was investigating some performance issues on our largest Hadoop cluster related to Datanode I/O […]

What’s in my datacenter tool kit?

Every Operations person or datacenter (DC) junkie that I know has a datacenter tool kit of some sort, containing their favorite bits of gear for doing work inside the cold, lonely world of the datacenter. Now, one would like to think that each company stocks the right tools for their folks to work, but tools […]

HBase Motel: SPLITS check in but don’t check out

In HBase, the Master process will periodically call for the splitting of a region if it becomes too large. Normally, this happens automatically, though you can manually trigger a split. In our case, we rarely do an explicit region split by hand. A new Master SPLIT behavior: let’s investigate We have an older HBase cluster […]

Restarting HBase Regionservers using JSON and jq

We run HBase as part of our Hadoop cluster. HBase sits on top of HDFS and is split into two parts: the HBase Master and the HBase Regionservers. The master coordinates which regionservers are in control of each specific region. Automating Recovery Responses We periodically have to do some minor maintenance and upkeep, including restarting […]

My Hadoop cluster data needs no RAID!

One of the operational challenges in introducing Hadoop to traditional IT and Enterprise operations is understanding when to break one of our sacred IT mantras: Thou shalt always RAID your data. Never shalt thou install a system without RAID. One shall be your RAID if thou seekest performance and redundancy sparing no expense. Five shall […]

Improving Hadoop datanode disk fault tolerance

By design, Hadoop is meant to tolerate failures in a responsible manner. One of those failure modes is for an HDFS datanode to go off line because it lost a data disk. By default, the datanode process will not tolerate any disk failures before shutting itself off. When this happens, the HDFS namenode discovers that […]

Rebooting Linux temporarily loses (some) limits.conf settings

Like any wildly managed environment, you probably have to create custom-defined settings in your /etc/security/limits.conf because of application specific requirements. Maybe you have to allow for more open files. Maybe you have to reduce the memory allowed to a process. Or maybe you just like being ultra-hardcore in defining exactly what a user can do. […]