Skip to main content

Not all Data Loss is Security Related

Matt invited me to guest author a post on his Standalone Sysadmin blog. One of the topics that I've had in the To-Blog pile is to dump out some thoughts on system backups. Head over to Matt's blog and read them.

Data loss events that result in data that is deleted, destroyed or corrupted are the DBA's and Sysadmins nightmare. Compare the results of these events:
  • A firewall or IPS has a hardware or software failure and throws away a few packets of good data.
  • A router gets overloaded and tosses a few packets in the bit bucket.
  • A SAN fabric has a hardware or software failure an throws away a few frames of data.
The latter is going to be a far, far more serious problem. Databases and file systems are extremely intolerant of missing bits.

Here's an example:
The reason that we suffered data loss (about 2.5 days) is because the data transfer issues with the SAN switch caused data corruption in both the Oracle data files and the archive log files. We had tape backups of the data files and archive log files, but they were also corrupt. Unfortunately, we could only recover the database to the last point that we had clean archive log files.
A SAN fabric scrambled a few bits. Data files and archive logs got toasted. Redundancy didn't help (the redundancy build into the SAN stored the scrambled bits - redundantly). The backup system either backed up the corrupt data, or failed because the data was corrupt. In either case the data is not recoverable. That scenario happens more often that anyone will admit.

Imagine a network where a few missing bits on Tuesday causes a loss of all data transferred across the network any time from Tuesday through Thursday.

Or worse:
So far, my efforts to recover Ma.gnolia's data store have been unsuccessful. While I'm continuing to work at it, both from the data store and other sources on the web, I don't want to raise expectations about our prospects. While certainly unanticipated, I do take responsibility and apologize for this widespread loss of data.

As of this writing, the recovery method appears to involve searching Google for cached copies of your missing data.  That's a good trick to remember. Someday I might lose my SSN or banking credentials and need to recover them.

Networks were designed from the ground up to assuming there would be missing bits. And just to make sure that network applications are always aware that they need to be tolerant of network data loss, network engineers intentionally build low level data loss into their designs. We wouldn't want network users to have too high of expectations, would we? Smile_dude Seriously though, lost packets have been a part of networking since day one, and as a result, any network protocol or application that couldn't tolerate loss quit working the day it got deployed.

Storage isn't designed to tolerate missing bits (though Sun is trying to fix that with ZFS). We've learned that we need to be extremely paranoid about storage related errors and events. There can be no tolerance for frame, CRC, port or other errors on a SAN fabric. Unfortunately, SAN switches are often represented as simple, low maintenance devices. They are not.

A quote from the post I wrote for Matt:

One of the things I've done to drive home the importance of backups is to walk up to a sysadmins cube and ask them to delete their home directory. I'm the boss, I can do that. Trust me, its fun. Smile_wink If they hesitate, I know right away that they don't have confidence in their backups. That's bad – for them, for me, and for our customers.

That covers the simple case. The files on that server are backed up and recoverable. Database backup and recovery is much more complex. Failure to recover a single incremental backup (archive log, transaction log) prevents recovery of the database past the point in time of the failed incremental. If that happens, it will be ugly.

The DBA’s that I know don’t look out at the hackers on the Internet a think ‘they are out to get me…I’ve got to be prepared…’. They are too busy looking down at the controllers and disks and thinking ‘they are out to get me…I’ve got to be prepared…’.

I’ve personally been faced with critical data loss incidents a handful of times. In one case, a network card decided to occasionally flip bits in transmitted packets before the various check sums & such that keep packets intact. The result was a situation where a ‘1’ at the client would end up as a ‘!’ at the server  - and in the database, and other single bit anomalies. In another case, the cache in a high end raid controller was scrambling bits and corrupting volumes. With the cache enabled, the volumes would error & dismount. With the cache disabled, the server worked fine. The worst though, was a human initiated logical failure of a 270 million row, 1000 table OLTP database. When I got called a couple minutes after the failure, it was a zero table, zero row database. A point-in-time recovery to the minute prior to the incident brought us back to where we were, minus a few seconds of data.

In each case, the backups worked. 

With apologies to Ms Browning:

How do I love thee? Let me count the ways
How do I fail thee? Let me count the ways
I fail thee to the controllers and drivers and ports
My pain can reach, when feeling out of sight
For the ends of the fabric and LUN
I fail thee to the level of everyday’s
Most critical data….


Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …