Data Loss, Speculation & Rumors

My head wants to explode. Not because of Microsoft's catastrophic data loss. That's a major failure that should precipitate a significant number of RGE's at Microsoft.

My head wants to explode because of the speculation, rumors and misinformation surrounding the failure. An example: rumors reported by Daniel Eran Dilger in Microsoft’s Sidekick/Pink problems blamed on dogfooding and sabotage that point to intentional data loss or sabotage.

"Danger's existing system to support Sidekick users was built using an Oracle Real Application Cluster, storing its data in a SAN (storage area network) so that the information would be available to a cluster of high availability servers. This approach is expressly designed to be resilient to hardware failure."

and

"the fact that no data could be recovered after the problem erupted at the beginning of October suggests that the outage and the inability to recover any backups were the result of intentional sabotage by a disgruntled employee."

and

"someone with access to the servers at the datacenter must have inserted a time bomb to wipe out not just all of the data, but also all of the backup tapes, and finally, I suspect, reformatting the server hard drives so that the service itself could not be restarted with a simple reboot (and to erase any traces of the time bomb itself)."

Intentional sabotage?

How about some rational speculation, without black helicopters & thermite charges? How about simple mis-management? How about failed backup jobs followed by a failed SAN upgrade? I've had both backup and SAN failures in the past & fully expect to continue to have both types of failures again before I retire. Failed backups are common. Failed SAN's and failed SAN upgrades are less common, but likely enough that they must be accounted for in planning.

Let's assume as Dan speculates, it’s an Oracle RAC cluster. If all the RAC data is on a single SAN, with no replication or backups to separate media on separate controllers, then a simple human error configuring the SAN can easily result in spectacular failure.  If there is no copy of the Oracle data on separate media & separate controllers on a separate server, you CAN loose data in a RAC cluster. RAC doesn't magically protect you from SAN failure. RAC doesn't magically protect you from logical database corruption or DBA fat fingering. All the bits are still stored on a SAN, and if the SAN fails or the DBA fails, the bits are gone. Anyone who doesn't think that's true hasn't driven a SAN or a database for a living.

We've owned two IBM DS-4800's for the last four years and have had three controller related failures that could have resulted in data loss had we not taken the right steps at the right time (with IBM advanced support looking over our shoulders). A simple thing like a mismatched firmware somewhere between the individual drives and the controllers or the HBA's and the operating system has the potential to cause data loss. Heck - because SAN configs can be stored on the individual drives themselves, plugging a used drive with someone else's SAN config into your SAN can cause SAN failure - or at least catastrophic, unintentional SAN reconfiguration.

I've got an IBM doc (sg246363) that says:

"Prior to physically installing new hardware, refer to the instructions in IBM TotalStorage DS4000 hard drive and Storage Expansion Enclosure Installation and Migration Guide, GC26-7849, available at:  [...snip...] Failure to consult this documentation may result in data loss, corruption, or loss of availability to your storage."

Does that imply that plugging the wrong thing into the wrong place at the wrong time can 'f up an entire SAN? Yep. It does, and from what the service manager for one of our vendors told me, it did happen recently – to one of the local Fortune 500’s.

If you don't buy that speculation, how about a simple misunderstanding between the DBA and the SAN team?

DBA: "So these two LUN's are on separate controllers, right?

SAN: "Yep."

Does anyone work at a place where that doesn't happen?

As for the ‘insider’ quote: "I don't understand why they would be spending any time upgrading stuff", a simple explanation would be that somewhere on the SAN, out of the hundreds of attached servers, a high-profile project needed the latest SAN controller firmware or feature. Let's say, for example, that you wanted to plug a shiny new Windows 2008 server into the fabric and present a LUN from and older SAN. You'd likely have to upgrade the SAN firmware. To have a supported configuration, there is a fair chance that you'd have to upgrade HBA firmware and HBA drivers on all SAN attached servers. The newest SAN controller firmware that's required for 'Project 2008' then forces an across the board upgrade of all SAN attached servers, whether they are related to ‘Project 2008’ or not. It's not like that doesn't happen once a year or so, and it’s the reason that SAN vendors publish ‘compatibility matrixes’.

And upgrades sometimes go bad.

We had an HP engineer end up hospitalized a couple days before a major SAN upgrade. A hard deadline prevented us from delaying the upgrade. The engineer's manager flew in at the last minute, reviewed the docs, did the upgrade - badly. Among other f'ups, he presented a VMS LUN to a Windows server. The Windows server touched the LUN & scrambled the VMS file system. A simple error, a catastrophic result. Had that been a RAC LUN, the database would have been scrambled. It happened to be a VMS LUN that was recoverable from backups, so we survived.

Many claim this is a cloud failure.  I don't. As far as I can see, it's a service failure, plain and simple, independent of how it happens to be hosted. If the data was stored on an Amazon, Google or Azure cloud, and if the Amazon, Google or Azure cloud operating system or storage software scrambled and/or lost the data, then it'd be a cloud failure. The data appears to have been on ordinary servers in an ordinary database on a ordinary SAN.

That makes this an ordinary failure.

A Zero Error Policy – Not Just for Backups

In What is a Zero Error Policy, Preston de Guise articulates the need for aggressive follow up and resolution on all backup related errors. It’s a great read.

Having a zero error policy requires the following three rules:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

and

I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.

I agree. This is a great summary of an important philosophy.

Don’t apply this just to backups though. It doesn’t matter what the system is, if you ignore the little warning signs, you’ll eventually end up with a major failure. In system administration, networks and databases, there is no such thing as a ‘transient’ or ‘routine’ error, and ignoring them will not make them go away. Instead, the minor alerts, errors and events will re-occur as critical events at the worst possible time. If you don’t follow up on ‘routine’ errors, find their root cause and eliminate them, you’ll never have the slightest chance of improving the security, availability and performance of your systems.

I could list an embarrassing number of situations where I failed to follow up on a minor event and had it cascade to a major, service affecting event. Here’s a few examples:

  • A strange undecipherable error when plugging a disk into an IBM DS4800 SAN. IBM didn’t think it was important. A week later I had a DS4800 with a hung mirrored disk set & a 6 hour production outage.
  • A pair of internal disks on a new IBM 16 CPU x460 that didn’t perform consistently in a pre-production test with IoZone. During some tests, the whole server would hang for minute & then recover. IBM couldn’t replicate the problem. Three months later the drives on that controller started ‘disappearing’ at random intervals. After three more months, a hundred person-hours of messing around, uncounted support calls and a handful of on site part-swapping fishing expeditions, IBM finally figured out that they had a firmware bug in their OEM’d Adapted RAID controllers.
  • An unfamiliar looking error in on a DS4800 controller at 2am. Hmmm… doesn’t look serious, lets call IBM in the morning. At 6am, controller zero dropped all it’s LUN’s and the redundant controller claimed cache consistency errors. That was an 8 hour outage.

Just so you don’t think I’m picking on IBM:

  • An HA pair of Netscaler load balancers that occasionally would fail to sync their configs. During a routine config change a month later, the secondary crashed and the primary stopped passing traffic on one of the three critical apps that it was front-ending. That was a two hour production outage.
  • A production HP file server cluster that was fiber channel attached to both a SAN and a tape library would routinely kick out a tapes and mark them bad. Eventually it happened often enough that I couldn’t reliably back up the cluster. The cluster then wedged itself up a couple times and caused production outages. The root cause? An improperly seated fiber channel connector. The tape library was trying really, really hard to warn me. 

In each case there was plenty of warning of the impending failure and aggressive troubleshooting would have avoided an outage. I ignored the blinking idiot lights on the dashboard and kept driving full speed.

I still end up occasionally passing over minor errors, but I’m not hiding my head in the sand hoping it doesn’t return. I do it knowing that the error will return. I’m simply betting that when it does, I’ll have better logging, better instrumentation, and more time for troubleshooting.

Content vs. Style - modern document editing

On ars technica, Jeremy Reimer writes great thoughts on how we use word processing.

His description of modern document editing:
Go into any office today and you'll find people using Word to write documents. Some people still print them out and file them in big metal cabinets to be lost forever, but again this is simply an old habit, like a phantom itch on a severed limb. Instead of printing them, most people will email them to their boss or another coworker, who is then expected to download the email attachment and edit the document, then return it to them in the same manner. At some point the document is considered "finished", at which point it gets dropped off on a network share somewhere and is then summarily forgotten...