Not all Data Loss is Security Related

Matt invited me to guest author a post on his Standalone Sysadmin blog. One of the topics that I've had in the To-Blog pile is to dump out some thoughts on system backups. Head over to Matt's blog and read them.

Data loss events that result in data that is deleted, destroyed or corrupted are the DBA's and Sysadmins nightmare. Compare the results of these events:

A firewall or IPS has a hardware or software failure and throws away a few packets of good data.
A router gets overloaded and tosses a few packets in the bit bucket.
A SAN fabric has a hardware or software failure an throws away a few frames of data.

The latter is going to be a far, far more serious problem. Databases and file systems are extremely intolerant of missing bits.

Here's an example:

The reason that we suffered data loss (about 2.5 days) is because the data transfer issues with the SAN switch caused data corruption in both the Oracle data files and the archive log files. We had tape backups of the data files and archive log files, but they were also corrupt. Unfortunately, we could only recover the database to the last point that we had clean archive log files.

A SAN fabric scrambled a few bits. Data files and archive logs got toasted. Redundancy didn't help (the redundancy built into the SAN stored the scrambled bits - redundantly). The backup system either backed up the corrupt data, or failed because the data was corrupt. In either case the data is not recoverable. That scenario happens more often than anyone will admit.

Imagine a network where a few missing bits on Tuesday causes a loss of all data transferred across the network any time from Tuesday through Thursday.

Or worse:

So far, my efforts to recover Ma.gnolia's data store have been unsuccessful. While I'm continuing to work at it, both from the data store and other sources on the web, I don't want to raise expectations about our prospects. While certainly unanticipated, I do take responsibility and apologize for this widespread loss of data.

As of this writing, the recovery method appears to involve searching Google for cached copies of your missing data. That's a good trick to remember. Someday I might lose my SSN or banking credentials and need to recover them.

Networks were designed from the ground up to assume there would be missing bits. And just to make sure that network applications are always aware that they need to be tolerant of network data loss, network engineers intentionally build low level data loss into their designs. We wouldn't want network users to have too high of expectations, would we? 😎

Seriously though, lost packets have been a part of networking since day one, and as a result, any network protocol or application that couldn't tolerate loss quit working the day it got deployed.

Storage isn't designed to tolerate missing bits (though Sun is trying to fix that with ZFS). We've learned that we need to be extremely paranoid about storage related errors and events. There can be no tolerance for frame, CRC, port or other errors on a SAN fabric. Unfortunately, SAN switches are often represented as simple, low maintenance devices. They are not.

A quote from the post I wrote for Matt:

One of the things I've done to drive home the importance of backups is to walk up to a sysadmins cube and ask them to delete their home directory. I'm the boss, I can do that. Trust me, its fun. If they hesitate, I know right away that they don't have confidence in their backups. That's bad – for them, for me, and for our customers.

With apologies to Ms Browning:

~~How do I love thee? Let me count the ways~~
How do I fail thee? Let me count the ways
I fail thee to the controllers and drivers and ports
My pain can reach, when feeling out of sight
For the ends of the fabric and LUN
I fail thee to the level of everyday’s
Most critical data….