Skip to main content – A Crash Course in Failure

One of the things we system managers dread the most is having the power yanked out from under our servers, something that happens far too frequently (and hits the news pretty regularly). Why? Because we don't trust file systems and databases to gracefully handle abnormal termination. We've all had or heard of file system and database corruption just from a simple power outage. Servers have been getting the power yanked out from under them for five decades, and we still don't trust them to crash cleanly? That's ridiculous. Five decades and thousands of programmer-years of work effort ought to have solved that problem by now. It’s not like it’s going to go away anytime in the next five decades.

In A Crash Course in Failure, Craig Stuntz discuses the concept of building crash only software – or software for which a crash and a normal shutdown are functionally equivalent.

“Hardware will fail. Software will crash. Those are facts of life.”
"…if you believe you have designed for redundancy and availability, but are afraid to hard-fault a rack due to the presence of non-crash-only hardware or software, then you're fooling yourself."
"…maintain savable user data in a recoverable state for the entire lifecycle of your application, and simply do nothing when the system restarts."
“…it is sort of absurd that users have to tell software that they would like to save their work. In truth, users nearly always want to save their work. Extra action should only be required in the unusual case where the user would like to throw their work away.”
Why shouldn't continuous and automatic state saving be the default for any/all applications? A CAD system I bought in 1984 did exactly that. If the system crashed or terminated abnormally, the post-crash reboot would do a complete 'replay' of every edit since the last normal save. In fact you'd have to sit and watch every one of your drawing edits in sequence like a VCR on fast forward, a process that was usually pretty amusing in a Keystone Cops sort of way. It can't be that hard to write serialized changes to the end of the document & only re-write the whole doc when the user explicitly saves the doc or journal every change to another file. That CAD system did it twenty-five years ago on on 4mhz CPU and 8" floppies. Some applications are at least attempting to gracefully recover after a crash, a step in the right direction. It certainly is not any harder than what Etherpad does- and they are doing it multi-user, real time, on the Internet.
“Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. Once you accept that failures will happen, you have the ability to design your system's reaction to specific failures. … If you do not design your failure modes, then you will get whatever unpredictable---and usually dangerous---ones happen to emerge.” -- Michael Nygard

A Crash Course in Failure, Craig Stuntz
Design your Failure Modes, Michael Janke
'Everything will ultimately fail', Michael Nygard


  1. This is the very problem that journaled filesystems and transactional databases were meant to resolve.

    Somewhere along the line, these ends were twisted into crazy ZFS, the transaction based FS that will utterly and irrevocably lose your data if you accidentally unplug the USB cord.


  2. I followed a long thread on ext4 vs. ext3, and the design decisions. With ext4, it looks like they've clearly favored performance over integrity, justifying it by saying that linux servers are up for years at a time, so if the file system caches a minute of writes that's somehow OK.

    Based on what I read - I'd stay away from it.

    We shouldn't have to make choices like that. It's the 21st century. We should be able to have performance without compromising integrity!

  3. It's apocryphal, but I heard in the mid-80s that Irix's new XFS filesystem had some late-stage bugs marked against it for corruption-on-shutdown problems.

    This was reportedly due to the fact that the developers were only testing by yanking the power plug in dev/test, not doing a normal shutdown, and missing some corner cases.

    Take home message, I suppose, is don't design *only* for failure.



Post a Comment

Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …