Skip to main content

Design your Failure Modes

In his axiom 'Everything will ultimately fail', Michael Nygard writes that in IT systems, one must:
"Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. [....]  If you do not design your failure modes, then you will get whatever unpredictable---and usually dangerous---ones happen to emerge."
I'm pretty sure that I've seen a whole bunch of systems and applications where that sort of thinking isn't on the top of the software architects or developers stack. For example, I've seen:
  • apps that spew out spurious error messages to critical logs files at a rate that makes 'tail -f' useless. Do the errors mean anything? Nope - just some unhandled exceptions. Could the app be written to handle the exceptions? Yeh, but we have deadlines...... 
  • apps that log critical application server/database connectivity error messages back to the database that caused the error. Ummm...if the app server can't connect to the database, why would you attempt to log that back to the database? Because that's how our error handler is designed. Doesn't that result in a recursive death spiral of connection errors that generate errors that get logged through the connections that are in an error state? Ummm.. let me think about that.....
  • apps that stop working when there are transient network errors, and need to be restarted to recover. Network errors are normal. Really? We never had that problem with our ISAM files!. Can you build your app to gracefully recover from them? Yeh, but we have deadlines...... 
  • apps that don't start up if there are leftover temp files from when they crashed and left temp files all over the place. Could you clean up old temp files on startup? How would I know which ones are old?

I suspect that mechanical engineers and metallurgists, when designing motorcycles, autos, and things that hurt people, have that sort of axiom embedded into their daily thought processes pretty deeply. I suspect that most software architects do not.

So the interesting question is  - if there are many failure modes, how do you determine which failure modes that you need to engineer around and which ones you can safely ignore?

On the wide area network side of things, we have a pretty good idea what the failure modes are, and it is clearly a long tail sort of problem, something like:

We've seen that circuit failures, mostly due to construction, backhoes, and other human/mechanical problems, are by far the largest cause of failures and are also the slowest to get fixed. Second place, for us, is power failures at sites without building generator/UPS, and a distant third is hardware failure.  In a case like that, if we care about availability, redundant hardware isn't anywhere near as important as redundant circuits and decent power.

Presumably each system has a large set of possible failure modes, and coming up with a rational response to the failure modes that are on the left side of the long tail is critical to building available systems, but it is important to keep in mind that not all failure modes are caused by non-animate things.

In system management, human failures, as in a human pushing the wrong button at the wrong time, are common and need to be engineered around just like mechanical or software failures. I suspect that is why we need things like change management or change control, documentation and the other no-so-fun parts of managing systems. And humans have the interesting property of being able to compound a failure by attempting to repair the problem, perhaps the reason why some form of outage/incident handling is important.

In any case, Nygard's axiom is worth the read.


  1. Michael,

    This is a really good point. In gambling, or investing, it's called "having an exit strategy".

    Like you said, once you acknowledge that your servers aren't going to last forever, you empower yourself.


Post a Comment

Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …