Ran across Post mortems at Hubspot: What I learned from 250 Whys today. This is a good review of Hubspot’s experience with 5-whys to facilitate post-mortems. The part that most caught my eye was the idea that “slow down” probably should not be the initial response to development velocity and mistakes if you don’t also consider the cost of slow down.
Slow down is optional
In addition to forbidding a future in which we’re less stupid, it’s also useful to try to take “slow down” off the table at the outset. […] That’s not strictly a part of 5 Whys, but people tend to underestimate the economic costs of slowing down, so it’s good to make sure that you don’t retrospective your way into waterfall. HubSpot has a pretty deep cultural commitment to velocity, but even so, I’ve found it useful to say this out loud on occasion.
Considering cost is hard to do when you feel overwhelmed by just the sheer amount of change that occurs. For my environment, forward progress can be fast and furious as we scale out the architecture and add people to the organization, both in Engineering and Operations.
I wouldn’t describe it as “move fast and break things.” It’s more about figuring out the right way to enable Engineering to do its job without making Operations a choke point, inadvertent or otherwise. Bugs happen. Mistakes happen.
The question is, when is it proper to stop progress until a fix appears versus incrementally moving forward and avoiding the failure mode trigger until a fix appears. No one likes stepping on land mines. Let’s avoid them if we can.
In response to Root-cause vs Broadest Fix
When I lead them, I say something like “We’re not really looking for ‘root causes’ so much as identifying the areas where an improvement would prevent the broadest classes of problems going ahead.”
I find that perspective useful — instead of people debating “Which of these two causes is the real one?” you’re asking them “If we made A slightly better or B slightly better, which would prevent the broadest class of future problems?”
The idea of “broadest fix” was also useful to consider, especially from an Ops perspective, where we (for good or bad) tend to fixate on “how do we prevent this specific error/incident from happening again” instead of preventing its general class of failure mode. Eventually we get there, but I think we focus too long on the individual instances of a problem before we figure out there’s a larger or systemic issue at hand.
I’ve always liked the idea of using 5-whys for doing root-cause analysis. I had a mentor who was always saying that, in almost all cases, you can live through one mistake and that it often took two or more before the issue became cataclysmic. The combination of multiple minor, sometimes unrelated, mistakes are what lead to a systemic failure. Any one mistake may have been self-correcting, if occurring alone. But when considering the event as a whole, each additional mistake confers more and more risk, leading to the unhappy moment where you’re updating a status page describing that your service is down.
It helps me think of more complete responses to incidents if I can look at it from the standpoint that there may not be one underlying cause, but a set of contributing factors. The correct solution for preventing the problem may require not focusing on a single mistake, but looking for commonalities among them, and solving that.