in DevOps

Lack of backup foils Va.’s new IT system | Richmond Times-Dispatch

“Every time we’re down for an hour, that’s about 2,500 people inconvenienced,” Smit said. “They’re blaming my people for it and [state IT officials] have an obligation to fix it.”

Lack of backup foils Va.’s new IT system | Richmond Times-Dispatch.

One of the things we’ve been grappling with lately is some unfortunate unplanned outages of services.  You know what those are … random event blips caused by butterflies flapping their wings in the South Pacific that stir up turbulence which creates a small wind, that then turns into a hurricane, which rampages over a submarine cable used by the crucial bit of networking that connects you with the rest of the Internet civilization.

A blip.

Sometimes they’re momentary, sometimes they’re bad.  What all blips have in common is that they affect a class of your customers in a way that inconviences them in some manner.  The hard part of dealing with an outage is understanding and quantifying what the business impact really is.  When a database server goes out, you implicitly undrestand that it potentially affects all database users plus all services and users downstream that depend upon the database being up.  So how do you realisitically quantify that into a valuable metric?

I bring this up because, as a person in the trenches, I’m able to better understand the impact of something (and therefore, provide a better mitigation plan) if I can understand the size, length, and number of ripples in the fabric that spread out from the blip.

At large companies, this impact may be described as thousands of dollars per minute of cost charged against the bottom line.  Some places, like VA, point out the number of people an hour that an outage prevented someone from successfully interacting with the DMV.  Websites may see it as the number of advertising impressions that don’t go out due to the site being unavailable.

Whatever metric is used, it needs to be understandable and an order of magnitude that someone can comprehend.  I understand impacting 2500 people per hour of down time.  I understand costing a company $1 million dollars per minute that the factory is unable to reach it’s control network. I understand an outage costing an engineering team a day’s worth of work (which can ultimately affect the bottom line due to down stream slippage in timelines). What that metric comes down to is being able to understand, in measurable terms, how the blip impacts either people or money.

It’s important to understand these things.  Why?  Because it allows you to more adequately assess your risk of the (unplanned) outage and design your environment appropriately.  If you can point to a solid metric and show how it materially affects people or money, it’s certainly a lot easier to go to management and provide justification for improvements in your environment.  If you can only say, with vague hand waving, that there’s AN effect but no data to back that up, you’re just waffling.

So.  Have you created your approrpriately detailed outage impact metrics?

I haven’t.  But I’m working on it.

Travis Campbell
Staff Systems Engineer at ghostar
Travis Campbell is a seasoned Linux Systems Engineer with nearly two decades of experience, ranging from dozens to tens of thousands of systems in the semiconductor industry, higher education, and high volume sites on the web. His current focus is on High Performance Computing, Big Data environments, and large scale web architectures.