Tuesday, August 11, 2009

A Zero Error Policy – Not Just for Backups

In What is a Zero Error Policy, Preston de Guise articulates the need for aggressive follow up and resolution on all backup related errors. It’s a great read.

Having a zero error policy requires the following three rules:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

and

I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.

I agree. This is a great summary of an important philosophy.

Don’t apply this just to backups though. It doesn’t matter what the system is, if you ignore the little warning signs, you’ll eventually end up with a major failure. In system administration, networks and databases, there is no such thing as a ‘transient’ or ‘routine’ error, and ignoring them will not make them go away. Instead, the minor alerts, errors and events will re-occur as critical events at the worst possible time. If you don’t follow up on ‘routine’ errors, find their root cause and eliminate them, you’ll never have the slightest chance of improving the security, availability and performance of your systems.

I could list an embarrassing number of situations where I failed to follow up on a minor event and had it cascade to a major, service affecting event. Here’s a few examples:

  • A strange undecipherable error when plugging a disk into an IBM DS4800 SAN. IBM didn’t think it was important. A week later I had a DS4800 with a hung mirrored disk set & a 6 hour production outage.
  • A pair of internal disks on a new IBM 16 CPU x460 that didn’t perform consistently in a pre-production test with IoZone. During some tests, the whole server would hang for minute & then recover. IBM couldn’t replicate the problem. Three months later the drives on that controller started ‘disappearing’ at random intervals. After three more months, a hundred person-hours of messing around, uncounted support calls and a handful of on site part-swapping fishing expeditions, IBM finally figured out that they had a firmware bug in their OEM’d Adapted RAID controllers.
  • An unfamiliar looking error in on a DS4800 controller at 2am. Hmmm… doesn’t look serious, lets call IBM in the morning. At 6am, controller zero dropped all it’s LUN’s and the redundant controller claimed cache consistency errors. That was an 8 hour outage.

Just so you don’t think I’m picking on IBM:

  • An HA pair of Netscaler load balancers that occasionally would fail to sync their configs. During a routine config change a month later, the secondary crashed and the primary stopped passing traffic on one of the three critical apps that it was front-ending. That was a two hour production outage.
  • A production HP file server cluster that was fiber channel attached to both a SAN and a tape library would routinely kick out a tapes and mark them bad. Eventually it happened often enough that I couldn’t reliably back up the cluster. The cluster then wedged itself up a couple times and caused production outages. The root cause? An improperly seated fiber channel connector. The tape library was trying really, really hard to warn me. 

In each case there was plenty of warning of the impending failure and aggressive troubleshooting would have avoided an outage. I ignored the blinking idiot lights on the dashboard and kept driving full speed.

I still end up occasionally passing over minor errors, but I’m not hiding my head in the sand hoping it doesn’t return. I do it knowing that the error will return. I’m simply betting that when it does, I’ll have better logging, better instrumentation, and more time for troubleshooting.

Tuesday, August 4, 2009

Content vs. Style - modern document editing

On ars technica,  Jeremy Reimer writes great thoughts on how we use word processing.

His description of modern document editing:

Go into any office today and you'll find people using Word to write documents. Some people still print them out and file them in big metal cabinets to be lost forever, but again this is simply an old habit, like a phantom itch on a severed limb. Instead of printing them, most people will email them to their boss or another coworker, who is then expected to download the email attachment and edit the document, then return it to them in the same manner. At some point the document is considered "finished", at which point it gets dropped off on a network share somewhere and is then summarily forgotten...
We use an application that was optimized to format printed documents in a world where printing is irrelevant, and our ‘document versioning’ is managed by the timestamps on the e-mail messages that we used to ‘collaborate’ on writing the document. What a mess, yet it's our perverse idea of what technology should be in the 21st century.

I'm sold on the idea of
  • online collaborative editing of documents
  • minimal formatting
  • continuous versioning
In other words I like wiki's. Some of my wiki docs are a decade old. I can find them. I can revert them back a decade if I want. I can rely on them in a DR event. I know who changed them & when they changed. I know what they contained before they were changed. They have bold, italics and headline fonts. I'm happy.

I'm even happier after I delete the hundred-odd useless fonts that come with my computers. I figure one or two each of serif, sans-serif and monospace is more than adequate. If I see more than a handful in the drop down font menu, I'm annoyed enough to start deleting them. We can thank Apple for that mess. The really cool people who bought early Mac’s needed to show off their GUI text editors by printing docs with six different font’s on a page (on a really crappy dot-matrix printer). It took them a while to figure out that it’s the content, not the style.

I'm really amused when archaic processes are updated by superficially skinning them over with technology.

True story, happens all the time:
  1. Senior manager with long title dictates memo to clerical staff.
  2. Clerical staff types memo in word processing software.
  3. Clerical staff prints memo.
  4. Senior manager signs memo.
  5. Clerical staff scans signed memo and saves as a PDF.
  6. Clerical staff e-mails memo to staff with subject line 'Please read attached memo from senior manager with long title'.
Someone isn't getting this whole technology thing. If the message from the senior manager with long title was really important, I'd have thought that it'd be in the opening paragraph of an e-mail from the senior manager with long title directly to the interested parties. If it were, I'd have read it instead of deleting it. It's the content that matters, not the container.

Equally amusing is the vast resources that we spend making web sites look pretty. It seems to me that the focus on a web site should be something like
  1. world class content
  2. decent writing style and readability
  3. make it look pretty
Instead we do something like:
  1. make it look pretty
  2. game the search engines
  3. optimize for ad revenue
  4. generate content (optional)
If you want me to read your content, don't waste your time making your site look pretty. I'll likely use a formatting tool to strip all that prettiness out anyway. That is – of course – if you have any interesting content amid all that prettiness.