Tuesday, August 11, 2009

A Zero Error Policy – Not Just for Backups

In What is a Zero Error Policy, Preston de Guise articulates the need for aggressive follow up and resolution on all backup related errors. It’s a great read.

Having a zero error policy requires the following three rules:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

and

I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.

I agree. This is a great summary of an important philosophy.

Don’t apply this just to backups though. It doesn’t matter what the system is, if you ignore the little warning signs, you’ll eventually end up with a major failure. In system administration, networks and databases, there is no such thing as a ‘transient’ or ‘routine’ error, and ignoring them will not make them go away. Instead, the minor alerts, errors and events will re-occur as critical events at the worst possible time. If you don’t follow up on ‘routine’ errors, find their root cause and eliminate them, you’ll never have the slightest chance of improving the security, availability and performance of your systems.

I could list an embarrassing number of situations where I failed to follow up on a minor event and had it cascade to a major, service affecting event. Here’s a few examples:

  • A strange undecipherable error when plugging a disk into an IBM DS4800 SAN. IBM didn’t think it was important. A week later I had a DS4800 with a hung mirrored disk set & a 6 hour production outage.
  • A pair of internal disks on a new IBM 16 CPU x460 that didn’t perform consistently in a pre-production test with IoZone. During some tests, the whole server would hang for minute & then recover. IBM couldn’t replicate the problem. Three months later the drives on that controller started ‘disappearing’ at random intervals. After three more months, a hundred person-hours of messing around, uncounted support calls and a handful of on site part-swapping fishing expeditions, IBM finally figured out that they had a firmware bug in their OEM’d Adapted RAID controllers.
  • An unfamiliar looking error in on a DS4800 controller at 2am. Hmmm… doesn’t look serious, lets call IBM in the morning. At 6am, controller zero dropped all it’s LUN’s and the redundant controller claimed cache consistency errors. That was an 8 hour outage.

Just so you don’t think I’m picking on IBM:

  • An HA pair of Netscaler load balancers that occasionally would fail to sync their configs. During a routine config change a month later, the secondary crashed and the primary stopped passing traffic on one of the three critical apps that it was front-ending. That was a two hour production outage.
  • A production HP file server cluster that was fiber channel attached to both a SAN and a tape library would routinely kick out a tapes and mark them bad. Eventually it happened often enough that I couldn’t reliably back up the cluster. The cluster then wedged itself up a couple times and caused production outages. The root cause? An improperly seated fiber channel connector. The tape library was trying really, really hard to warn me. 

In each case there was plenty of warning of the impending failure and aggressive troubleshooting would have avoided an outage. I ignored the blinking idiot lights on the dashboard and kept driving full speed.

I still end up occasionally passing over minor errors, but I’m not hiding my head in the sand hoping it doesn’t return. I do it knowing that the error will return. I’m simply betting that when it does, I’ll have better logging, better instrumentation, and more time for troubleshooting.

2 comments:

  1. It is a good rule of thumb to follow. I have been in a few situations where I was strapped for resources and had re-occurring oddities. There were a couple that I never was able to crack no matter the additional logging I put in place. Eventually I had to settle for predictive identification and triggered service and server reboots.

    I fought that for so long while trying to nail down what was actually causing the errors. If I had more resources I would have swapped out hardware but in that case I was stuck between a rock and a hard place.

    ReplyDelete
  2. Good post, Michael, and yes, many times those small, niggling details that only show themselves sometimes are precursors to much larger systemic problems.

    And the hardest part about it is what Nick mentions...intermittent problems are HARD. As soon as you really get to the good part of debugging an issue, it goes away on its own.

    I had one like that last week. Every once in a while, the internet would die. Except, it really wouldn't. Not all of it, anyway. Some stuff was ok, others wasn't. I just about pulled my hair out troubleshooting, because as soon as it started, it would stop.

    Of course, like all bizarre network problems, it was rooted in DNS...sort of. It turns out the the local DNS server for the office was having hard drive problems, and couldn't read cached entries. It knew the entry was there, but it froze returning it. Eventually, something would timeout and the server would return an error. Sometimes this happened before the client queried the 2nd DNS server, and sometimes not.

    It was a lot of fun. Really. :-)

    ReplyDelete