The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:

In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These differences make certain failure modes much more difficult to design around.

This bring up the bet-your-career question: If you’ve outsourced your bits to a cloud provider, how do you handle the very real possibility that your cloud provider will cease to exist due to either technical failure, such as accidently deleting customer data[2], or financial failure, such as sudden bankruptcy[3]? Or - how do you handle what I’ll call Permanent Provider Failure?

In the aforementioned post I brought up three ‘lessons’ that could be carried through from routine network outsourcing. 
  • The provider matters
  • Design for failure
  • Deploy a standard technology 
These three concepts also serve to protect an outsourced network from the failure of a provider.

Here’s how:

The provider: The more you care about your network connectivity, the more likely you are to choose a tier 1 provider, presumably one that has assets sufficient to ensure that their network survives corporate failure (as happened more than once during the 2001 dot-bomb). If you can’t use a tier 1, at the very minimum you’ll pick a provider that has actual assets that will survive bankruptcy (fiber in the ground), and you’ll pick a provider that has a long track record of providing service at the scale and availability that you need for your organization. The provider matters.

The design: Because even tier 1’s occasionally bork their backbone[4], you’ve probably already built critical parts of your network using redundancy from some other provider. Provider redundancy has the nice property of insulating you from a both a technical failure of the providers network and also from permanent failure of one of your providers. Design for failure.

The technology: If you want seamless redundancy across providers, you have no choice but to build your network using a standard technology. Once you’ve done that, it doesn’t matter which provider transports which bits. The bits are provider neutral. Changing providers is not only possible, but often routine. Deploy a standard technology.

Rolling these concepts up the stack to cloud provided CPU/memory/disk is interesting. I’ll try to map the network examples into the higher layers.

The provider: You’d presumably want to pick a cloud provider that has a low probability of technical or financial failure. Track record and history matter, as does asset ownership. A provider that has real assets (data centers, hardware) has a chance of getting picked up by another provider during the next dot-bomb. As has often been the case with network provider bankruptcies, ideally there would be little or no disruption. A cloud provider that doesn’t actually own anything doesn’t have much to market when shopping around for someone to bail them out and take over their customers. The provider matters.

The design: The provider will fail. How will you design around it? I’m not talking about an occasional minor provider outage of a few hours duration, but rather a large scale permanent failure, such as the ones referenced in the links above. For networking, you protect yourself with a second link from another provider. In our case, for critical circuits we try to ensure that the two providers don’t share any physical infrastructure (fiber paths) and that they don’t buy fiber from each other. To build the equivalent at the upper layers, one would have to have functionally identical CPU/memory/disk at a second, completely independent provider. It’s easy to do on the network side. I suspect it’s harder to do that at the upper layers. Design for failures.

The technology: Building a cloud hosted app that can survive permanent provider failure is going to require that the technology used is standardized enough that it can be hosted by more than one provider. You’ll have to replicate application code and user data between providers, and that can only happen if both providers present the same technology stack to your application. This seems to favor off-the-shelf and/or platform neutral environments over single-vendor proprietary environments. In other words, the app needs to be portable across cloud vendors. Deploy a standard technology.

The bottom line is that to survive the permanent failure of a cloud provider, the app has to be provider neutral, not only to provide protection against provider failure, but also to permit migration of the application to a new cloud provider when your primary provider changes their price schedule, loses their revenue stream, or unexpectedly joins the deadpool[5].

[2] Serious Cloud Storage Stumble for Flexscale, Datacenter Knowledge, 08/28/2008
[3] MediaMax/The Linkup Closes Its Doors, TechCrunch 07/10/2008
[4] ATT Frame Relay Network goes down for the count, Network World, 04/20/1998
[5] TechCrunch deadpool