Skip to main content

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:

In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These differences make certain failure modes much more difficult to design around.  

This bring up the bet-your-career question: If you’ve outsourced your bits to a cloud provider, how do you handle the very real possibility that your cloud provider will cease to exist due to either technical failure, such as accidently deleting customer data[2], or financial failure, such as sudden bankruptcy[3]? Or - how do you handle what I’ll call Permanent Provider Failure?  

In the aforementioned post I brought up three ‘lessons’ that could be carried through from routine network outsourcing.

  • The provider matters
  • Design for failure
  • Deploy a standard technology

These three concepts also serve to protect an outsourced network from the failure of a provider.

Here’s how:

The provider: The more you care about your network connectivity, the more likely you are to choose a tier 1 provider, presumably one that has assets sufficient to ensure that their network survives corporate failure (as happened more than once during the 2001 dot-bomb).  If you can’t use a tier 1, at the very minimum you’ll pick a provider that has actual assets that will survive bankruptcy (fiber in the ground), and you’ll pick a provider that has a long track record of providing service at the scale and availability that you need for your organization. The provider matters.

The design: Because even tier 1’s occasionally bork their backbone[4], you’ve probably already built critical parts of your network using redundancy from some other provider. Provider redundancy has the nice property of insulating you from a both a technical failure of the providers network and also from permanent failure of one of your providers. Design for failure.

The technology: If you want seamless redundancy across providers, you have no choice but to build your network using a standard technology. Once you’ve done that, it doesn’t matter which provider transports which bits. The bits are provider neutral. Changing providers is not only possible, but often routine. Deploy a standard technology.

Rolling these concepts up the stack to cloud provided CPU/memory/disk is interesting. I’ll try to map the network examples into the higher layers.

The provider: You’d presumably want to pick a cloud provider that has a low probability of technical or financial failure. Track record and history matter, as does asset ownership. A provider that has real assets (data centers, hardware) has a chance of getting picked up by another provider during the next dot-bomb. As has often been the case with network provider bankruptcies, ideally there would be little or no disruption. A cloud provider that doesn’t actually own anything doesn’t have much to market when shopping around for someone to bail them out and take over their customers. The provider matters.

The design: The provider will fail. How will you design around it? I’m not talking about an occasional minor provider outage of a few hours duration, but rather a large scale permanent failure, such as the ones referenced in the links above. For networking, you protect yourself with a second link from another provider. In our case, for critical circuits we try to ensure that the two providers don’t share any physical infrastructure (fiber paths) and that they don’t buy fiber from each other. To build the equivalent at the upper layers, one would have to have functionally identical CPU/memory/disk at a second, completely independent provider. It’s easy to do on the network side. I suspect it’s harder to do that at the upper layers. Design for failures.

The technology: Building a cloud hosted app that can survive permanent provider failure is going to require that the technology used is standardized enough that it can be hosted by more than one provider. You’ll have to replicate application code and user data between providers, and that can only happen if both providers present the same technology stack to your application. This seems to favor off-the-shelf and/or platform neutral environments over single-vendor proprietary environments.  In other words, the app needs to be portable across cloud vendors. Deploy a standard technology.

The bottom line is that to survive the permanent failure of a cloud provider, the app has to be provider neutral, not only to provide protection against provider failure, but also to permit migration of the application to a new cloud provider when your primary provider changes their price schedule, loses their revenue stream, or unexpectedly joins the deadpool[5].

[1] The Cloud - Outsourcing Moved up the Stack, Last In - First Out
[2] Serious Cloud Storage Stumble for Flexscale, Datacenter Knowledge, 08/28/2008
[3] MediaMax/The Linkup Closes Its Doors, TechCrunch 07/10/2008
[4] ATT Frame Relay Network goes down for the count, Network World, 04/20/1998
[5] TechCrunch deadpool


  1. Hi Michael, long time no talk. Hope you had a good holiday weekend!

    I think these sorts of vulnerabilities (I don't think they could be classified as anything else) really need to be taken into account whenever you decide to outsource a critical piece of infrastructure. The effects are just more...apparent, maybe, when you're outsourcing large segments of infrastructure to "the cloud".

    Unfortunately, cloud computing doesn't yet have the pedigree that outsourced mail, web, DNS and other individual services do. Each of these services can hold potentially sensitive data, so you go with well respected companies that have sufficient infrastructure and backup policies to meet your standards. Cloud services just aren't there yet, unless you're an early adopter. And there's just no way I'd risk my company's data on it yet.

    I'm very grateful to the people who are taking the arrows and working out the kinks, though.


Post a Comment

Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…