In What is a Zero Error Policy, Preston de Guise articulates the need for aggressive follow up and resolution on all backup related errors. It’s a great read.
Having a zero error policy requires the following three rules:
- All errors shall be known.
- All errors shall be resolved.
- No error shall be allowed to continue to occur indefinitely.
and
I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.
I agree. This is a great summary of an important philosophy.
Don’t apply this just to backups though. It doesn’t matter what the system is, if you ignore the little warning signs, you’ll eventually end up with a major failure. In system administration, networks and databases, there is no such thing as a ‘transient’ or ‘routine’ error, and ignoring them will not make them go away. Instead, the minor alerts, errors and events will re-occur as critical events at the worst possible time. If you don’t follow up on ‘routine’ errors, find their root cause and eliminate them, you’ll never have the slightest chance of improving the security, availability and performance of your systems.
I could list an embarrassing number of situations where I failed to follow up on a minor event and had it cascade to a major, service affecting event. Here’s a few examples:
- A strange undecipherable error when plugging a disk into an IBM DS4800 SAN. IBM didn’t think it was important. A week later I had a DS4800 with a hung mirrored disk set & a 6 hour production outage.
- A pair of internal disks on a new IBM 16 CPU x460 that didn’t perform consistently in a pre-production test with IoZone. During some tests, the whole server would hang for minute & then recover. IBM couldn’t replicate the problem. Three months later the drives on that controller started ‘disappearing’ at random intervals. After three more months, a hundred person-hours of messing around, uncounted support calls and a handful of on site part-swapping fishing expeditions, IBM finally figured out that they had a firmware bug in their OEM’d Adapted RAID controllers.
- An unfamiliar looking error in on a DS4800 controller at 2am. Hmmm… doesn’t look serious, lets call IBM in the morning. At 6am, controller zero dropped all it’s LUN’s and the redundant controller claimed cache consistency errors. That was an 8 hour outage.
Just so you don’t think I’m picking on IBM:
- An HA pair of Netscaler load balancers that occasionally would fail to sync their configs. During a routine config change a month later, the secondary crashed and the primary stopped passing traffic on one of the three critical apps that it was front-ending. That was a two hour production outage.
- A production HP file server cluster that was fiber channel attached to both a SAN and a tape library would routinely kick out a tapes and mark them bad. Eventually it happened often enough that I couldn’t reliably back up the cluster. The cluster then wedged itself up a couple times and caused production outages. The root cause? An improperly seated fiber channel connector. The tape library was trying really, really hard to warn me.
In each case there was plenty of warning of the impending failure and aggressive troubleshooting would have avoided an outage. I ignored the blinking idiot lights on the dashboard and kept driving full speed.
I still end up occasionally passing over minor errors, but I’m not hiding my head in the sand hoping it doesn’t return. I do it knowing that the error will return. I’m simply betting that when it does, I’ll have better logging, better instrumentation, and more time for troubleshooting.