Sunday, July 12, 2009 – A Crash Course in Failure

One of the things we system managers dread the most is having the power yanked out from under our servers, something that happens far too frequently (and hits the news pretty regularly). Why? Because we don't trust file systems and databases to gracefully handle abnormal termination. We've all had or heard of file system and database corruption just from a simple power outage. Servers have been getting the power yanked out from under them for five decades, and we still don't trust them to crash cleanly? That's ridiculous. Five decades and thousands of programmer-years of work effort ought to have solved that problem by now. It’s not like it’s going to go away anytime in the next five decades.

In A Crash Course in Failure, Craig Stuntz discuses the concept of building crash only software – or software for which a crash and a normal shutdown are functionally equivalent.

“Hardware will fail. Software will crash. Those are facts of life.”
"…if you believe you have designed for redundancy and availability, but are afraid to hard-fault a rack due to the presence of non-crash-only hardware or software, then you're fooling yourself."
"…maintain savable user data in a recoverable state for the entire lifecycle of your application, and simply do nothing when the system restarts."
“…it is sort of absurd that users have to tell software that they would like to save their work. In truth, users nearly always want to save their work. Extra action should only be required in the unusual case where the user would like to throw their work away.”
Why shouldn't continuous and automatic state saving be the default for any/all applications? A CAD system I bought in 1984 did exactly that. If the system crashed or terminated abnormally, the post-crash reboot would do a complete 'replay' of every edit since the last normal save. In fact you'd have to sit and watch every one of your drawing edits in sequence like a VCR on fast forward, a process that was usually pretty amusing in a Keystone Cops sort of way. It can't be that hard to write serialized changes to the end of the document & only re-write the whole doc when the user explicitly saves the doc or journal every change to another file. That CAD system did it twenty-five years ago on on 4mhz CPU and 8" floppies. Some applications are at least attempting to gracefully recover after a crash, a step in the right direction. It certainly is not any harder than what Etherpad does- and they are doing it multi-user, real time, on the Internet.
“Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. Once you accept that failures will happen, you have the ability to design your system's reaction to specific failures. … If you do not design your failure modes, then you will get whatever unpredictable---and usually dangerous---ones happen to emerge.” -- Michael Nygard

A Crash Course in Failure, Craig Stuntz
Design your Failure Modes, Michael Janke
'Everything will ultimately fail', Michael Nygard


  1. This is the very problem that journaled filesystems and transactional databases were meant to resolve.

    Somewhere along the line, these ends were twisted into crazy ZFS, the transaction based FS that will utterly and irrevocably lose your data if you accidentally unplug the USB cord.


  2. I followed a long thread on ext4 vs. ext3, and the design decisions. With ext4, it looks like they've clearly favored performance over integrity, justifying it by saying that linux servers are up for years at a time, so if the file system caches a minute of writes that's somehow OK.

    Based on what I read - I'd stay away from it.

    We shouldn't have to make choices like that. It's the 21st century. We should be able to have performance without compromising integrity!

  3. It's apocryphal, but I heard in the mid-80s that Irix's new XFS filesystem had some late-stage bugs marked against it for corruption-on-shutdown problems.

    This was reportedly due to the fact that the developers were only testing by yanking the power plug in dev/test, not doing a normal shutdown, and missing some corner cases.

    Take home message, I suppose, is don't design *only* for failure.