Data loss events that result in data that is deleted, destroyed or corrupted are the DBA's and Sysadmins nightmare. Compare the results of these events:
- A firewall or IPS has a hardware or software failure and throws away a few packets of good data.
- A router gets overloaded and tosses a few packets in the bit bucket.
- A SAN fabric has a hardware or software failure an throws away a few frames of data.
Here's an example:
The reason that we suffered data loss (about 2.5 days) is because the data transfer issues with the SAN switch caused data corruption in both the Oracle data files and the archive log files. We had tape backups of the data files and archive log files, but they were also corrupt. Unfortunately, we could only recover the database to the last point that we had clean archive log files.A SAN fabric scrambled a few bits. Data files and archive logs got toasted. Redundancy didn't help (the redundancy build into the SAN stored the scrambled bits - redundantly). The backup system either backed up the corrupt data, or failed because the data was corrupt. In either case the data is not recoverable. That scenario happens more often that anyone will admit.
Imagine a network where a few missing bits on Tuesday causes a loss of all data transferred across the network any time from Tuesday through Thursday.
So far, my efforts to recover Ma.gnolia's data store have been unsuccessful. While I'm continuing to work at it, both from the data store and other sources on the web, I don't want to raise expectations about our prospects. While certainly unanticipated, I do take responsibility and apologize for this widespread loss of data.
As of this writing, the recovery method appears to involve searching Google for cached copies of your missing data. That's a good trick to remember. Someday I might lose my SSN or banking credentials and need to recover them.
Networks were designed from the ground up to assuming there would be missing bits. And just to make sure that network applications are always aware that they need to be tolerant of network data loss, network engineers intentionally build low level data loss into their designs. We wouldn't want network users to have too high of expectations, would we? Seriously though, lost packets have been a part of networking since day one, and as a result, any network protocol or application that couldn't tolerate loss quit working the day it got deployed.
Storage isn't designed to tolerate missing bits (though Sun is trying to fix that with ZFS). We've learned that we need to be extremely paranoid about storage related errors and events. There can be no tolerance for frame, CRC, port or other errors on a SAN fabric. Unfortunately, SAN switches are often represented as simple, low maintenance devices. They are not.
A quote from the post I wrote for Matt:
One of the things I've done to drive home the importance of backups is to walk up to a sysadmins cube and ask them to delete their home directory. I'm the boss, I can do that. Trust me, its fun. If they hesitate, I know right away that they don't have confidence in their backups. That's bad – for them, for me, and for our customers.
That covers the simple case. The files on that server are backed up and recoverable. Database backup and recovery is much more complex. Failure to recover a single incremental backup (archive log, transaction log) prevents recovery of the database past the point in time of the failed incremental. If that happens, it will be ugly.
The DBA’s that I know don’t look out at the hackers on the Internet a think ‘they are out to get me…I’ve got to be prepared…’. They are too busy looking down at the controllers and disks and thinking ‘they are out to get me…I’ve got to be prepared…’.
I’ve personally been faced with critical data loss incidents a handful of times. In one case, a network card decided to occasionally flip bits in transmitted packets before the various check sums & such that keep packets intact. The result was a situation where a ‘1’ at the client would end up as a ‘!’ at the server - and in the database, and other single bit anomalies. In another case, the cache in a high end raid controller was scrambling bits and corrupting volumes. With the cache enabled, the volumes would error & dismount. With the cache disabled, the server worked fine. The worst though, was a human initiated logical failure of a 270 million row, 1000 table OLTP database. When I got called a couple minutes after the failure, it was a zero table, zero row database. A point-in-time recovery to the minute prior to the incident brought us back to where we were, minus a few seconds of data.
In each case, the backups worked.
With apologies to Ms Browning:
How do I love thee? Let me count the ways
How do I fail thee? Let me count the ways
I fail thee to the controllers and drivers and ports
My pain can reach, when feeling out of sight
For the ends of the fabric and LUN
I fail thee to the level of everyday’s
Most critical data….