Skip to main content

Data Loss, Speculation & Rumors

My head wants to explode. Not because of Microsoft's catastrophic data loss. That's a major failure that should precipitate a significant numbers of RGE's at Microsoft.

My head wants to explode because of the speculation, rumors and mis-information surrounding the failure. An example: rumors reported by Daniel Eran Dilger in Microsoft’s Sidekick/Pink problems blamed on dogfooding and sabotage that point to intentional data loss or sabotage.

"Danger's existing system to support Sidekick users was built using an Oracle Real Application Cluster, storing its data in a SAN (storage area network) so that the information would be available to a cluster of high availability servers. This approach is expressly designed to be resilient to hardware failure."


"the fact that no data could be recovered after the problem erupted at the beginning of October suggests that the outage and the inability to recover any backups were the result of intentional sabotage by a disgruntled employee."


"someone with access to the servers at the datacenter must have inserted a time bomb to wipe out not just all of the data, but also all of the backup tapes, and finally, I suspect, reformatting the server hard drives so that the service itself could not be restarted with a simple reboot (and to erase any traces of the time bomb itself)."

Intentional sabotage?

How about some rational speculation, without black helicopters & thermite charges? How about simple mis-management? How about failed backup jobs followed by a failed SAN upgrade? I've had both backup and SAN failures in the past & fully expect to continue to have both types of failures again before I retire. Failed backups are common. Failed SAN's and failed SAN upgrades are less common, but likely enough that they must be accounted for in planning.

Let's assume as Dan speculates, it’s an Oracle RAC cluster. If all the RAC data is on a single SAN, with no replication or backups to separate media on separate controllers, then a simple human error configuring the SAN can easily result in spectacular failure.  If there is no copy of the Oracle data on separate media & separate controllers on a separate server, you CAN loose data in a RAC cluster. RAC doesn't magically protect you from SAN failure. RAC doesn't magically protect you from logical database corruption or DBA fat fingering. All the bits are still stored on a SAN, and if the SAN fails or the DBA fails, the bits are gone. Anyone who doesn't think that's true hasn't driven a SAN or a database for a living.

We've owned two IBM DS-4800's for the last four years and have had three controller related failures that could have resulted in data loss had we not taken the right steps at the right time (with IBM advanced support looking over our shoulders). A simple thing like a mis-matched firmware somewhere between the individual drives and the controllers or the HBA's and the operating system has the potential to cause data loss. Heck - because SAN configs can be stored on the individual drives themselves, plugging a used drive with someone else's SAN config into your SAN can cause SAN failure - or at least catastrophic, unintentional SAN reconfiguration.

I've got an IBM doc (sg246363) that says:

"Prior to physically installing new hardware, refer to the instructions in IBM TotalStorage DS4000 hard drive and Storage Expansion Enclosure Installation and Migration Guide, GC26-7849, available at:  [...snip...] Failure to consult this documentation may result in data loss, corruption, or loss of availability to your storage."

Does that imply that plugging the wrong thing into the wrong place at the wrong time can 'f up an entire SAN? Yep. It does, and from what the service manager for one of our vendors told me, it did happen recently – to one of the local Fortune 500’s.

If you don't buy that speculation, how about a simple misunderstanding between the DBA and the SAN team?

DBA: "So these two LUN's are on separate controllers, right?

SAN: "Yep."

Does anyone work at a place where that doesn't happen?

As for the ‘insider’ quote: "I don't understand why they would be spending any time upgrading stuff", a simple explanation would be that somewhere on the SAN, out of the hundreds of attached servers, a high profile project needed the latest SAN controller firmware or feature. Let's say, for example, that you wanted to plug a shiny new Windows 2008 server into the fabric and present a LUN from and older SAN. You'd likely have to upgrade the SAN firmware. To have a supported configuration, there is a fair chance that you'd have to upgrade HBA firmware and HBA drivers on all SAN attached servers. The newest SAN controller firmware that's required for 'Project 2008' then forces an across the board upgrade of all SAN attached servers, whether they are related to ‘Project 2008’ or not. It's not like that doesn't happen once a year or so, and it’s the reason that SAN vendors publish ‘compatibility matrixes’.

And upgrades sometimes go bad.

We had an HP engineer end up hospitalized a couple days before a major SAN upgrade. A hard deadline prevented us from delaying the upgrade. The engineer's manager flew in at the last minute, reviewed the docs, did the upgrade - badly. Among other f'ups, he presented a VMS LUN to a Windows server. The Windows server touched the LUN & scrambled the VMS file system. A simple error, a catastrophic result. Had that been a RAC LUN, the database would have been scrambled. It happened to be a VMS LUN that was recoverable from backups, so we survived.

Many claim this is a cloud failure.  I don't. As far as I can see, it's a service failure, plain and simple, independent of how it happens to be hosted. If the data was stored on an Amazon, Google or Azure cloud, and if the Amazon, Google or Azure cloud operating system or storage software scrambled and/or lost the data, then it'd be a cloud failure. The data appears to have been on ordinary servers in an ordinary database on a ordinary SAN.

That makes this an ordinary failure.


  1. Likewise, I'm getting cheesed off with the coverage. It's one serious screw up on the side of multiple parties, certainly ought to be an RGE.
    Backup, Backup, Backup, and definitely make sure to have good backups before doing any form of upgrade, be it hardware or firmware!

    Under advisement from a support engineer I've transferred a HDD from an unused SAN in as a replacement for a broken drive on a different SAN, and watched as the SAN took the entire shelf down for the count. Thankfully it was smart enough to realise that the configuration data wasn't consistent across the drives and stopped it from turning into a disaster. Unplugging the drive meant we could restore the shelf to operation but sadly did involve service outage.

  2. RoughlyDrafted has many fine points, but I salt heavily when it talks about security issues. And apparently should for operational issues as well.

  3. Garp - One of the IBM SAN incidents was caused by following and IBM engineers advice on how to move disks around between shelves. While doing that, I managed to get a RAID set in an 'undetermined' state, such that IBM's SAN firmware development support wasn't sure how it got in that state or how to recover it (other than to re-boot the whole SAN).

    We took an outage, copied all critical data off the SAN, reset the controllers and everything came back up. But I had to announce a 6 hour unscheduled outage to 'ensure the integrity of the data'.

    Anon -

    It seems like there are lots of people who use their experience running a mission critical laptop to become experts in running mission critical ERP applications.

  4. You're right, in principle - this is an explainable service failure. What makes it newsworthy from a cloud perspective is that even in very transparent clouds, such as Amazon, we are pretty much taking their word about their operational procedures and implying competence of their staff to execute those procedures. Things are worse when transparency goes down, as is the case with Sidekick. Users reasonably assumed (on an intuitive level, at least, even if they couldn't explain it) that their recovery plan could recover data from an offline backup at least. Why this is a failure of clouds overall is that nobody is saying, hey cloud operators, let's make a guarantee to users about our operational procedures that their data is safe, secure, exportable and available. Instead, Joe Manager/Developer/IT has to make a decision based on essentially the good faith promises (or total lack thereof) about data in cloud applications. It's hard to deny that as a whole, online data storage is a big messy question mark right now.

  5. I agree, DeD's coverage is often lamentably harsh when it comes anything Microsoft.

    However, like it or not, Sidekick was Cloud – it was Cloud for one very important reason: it was marketed as being Cloud. As Cloud it was safe.

    It can't later, when it fails, be explained away as "oh well they didn't really architect it according to the latest cloud hype from vendors and therefore it wasn't really cloud".

  6. The whole conspiracy thing has really gotten out of hand. Googling for "+Microsoft +sidekick +sabotage" gets a couple hundred thousand hits.


  7. An ex-colleague had a saying that was akin to Occam's razor - "Never attribute to malice what can be attributed to stupidity".

    In this case I'd rate the chance of there being a deliberate conspiracy of somewhere in the order of less than 0.05 percent.

    That doesn't preclude Sidekick going down either as a result of ineptness or budget slashing – but there's a big leap from either of those two conclusions (I favour ineptness, personally) to conspiracy.

    Then again, these days there's a conspiracy theory for everything. :(

  8. Preston -

    I' can't argue with that.

    I also tend to blame management (me) for failures like this, at least until proven otherwise.



Post a Comment

Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …