Saturday, November 29, 2008

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:
In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider.
There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These differences make certain failure modes much more difficult to design around.

This bring up the bet-your-career question: If you’ve outsourced your bits to a cloud provider, how do you handle the very real possibility that your cloud provider will cease to exist due to either technical failure, such as accidently deleting customer data[2], or financial failure, such as sudden bankruptcy[3]? Or - how do you handle what I’ll call Permanent Provider Failure?

In the aforementioned post I brought up three ‘lessons’ that could be carried through from routine network outsourcing.
  • The provider matters
  • Design for failure
  • Deploy a standard technology

These three concepts also serve to protect an outsourced network from the failure of a provider.

Here’s how:

  • The provider: The more you care about your network connectivity, the more likely you are to choose a tier 1 provider, presumably one that has assets sufficient to ensure that their network survives corporate failure (as happened more than once during the 2001 dot-bomb).  If you can’t use a tier 1, at the very minimum you’ll pick a provider that has actual assets that will survive bankruptcy (fiber in the ground), and you’ll pick a provider that has a long track record of providing service at the scale and availability that you need for your organization. The provider matters.
  • The design: Because even tier 1’s occasionally bork their backbone[4], you’ve probably already built critical parts of your network using redundancy from some other provider. Provider redundancy has the nice property of insulating you from a both a technical failure of the providers network and also from permanent failure of one of your providers. Design for failure.
  • The technology: If you want seamless redundancy across providers, you have no choice but to build your network using a standard technology. Once you’ve done that, it doesn’t matter which provider transports which bits. The bits are provider neutral. Changing providers is not only possible, but often routine. Deploy a standard technology.
Rolling these concepts up the stack to cloud provided CPU/memory/disk is interesting. I’ll try to map the network examples into the higher layers.
  • The provider: You’d presumably want to pick a cloud provider that has a low probability of technical or financial failure. Track record and history matter, as does asset ownership. A provider that has real assets (data centers, hardware) has a chance of getting picked up by another provider during the next dot-bomb. As has often been the case with network provider bankruptcies, ideally there would be little or no disruption. A cloud provider that doesn’t actually own anything doesn’t have much to market when shopping around for someone to bail them out and take over their customers. The provider matters.
  • The design: The provider will fail. How will you design around it? I’m not talking about an occasional minor provider outage of a few hours duration, but rather a large scale permanent failure, such as the ones referenced in the links above. For networking, you protect yourself with a second link from another provider. In our case, for critical circuits we try to ensure that the two providers don’t share any physical infrastructure (fiber paths) and that they don’t buy fiber from each other. To build the equivalent at the upper layers, one would have to have functionally identical CPU/memory/disk at a second, completely independent provider. It’s easy to do on the network side. I suspect it’s harder to do that at the upper layers. Design for failures.
  • The technology: Building a cloud hosted app that can survive permanent provider failure is going to require that the technology used is standardized enough that it can be hosted by more than one provider. You’ll have to replicate application code and user data between providers, and that can only happen if both providers present the same technology stack to your application. This seems to favor off-the-shelf and/or platform neutral environments over single-vendor proprietary environments.  In other words, the app needs to be portable across cloud vendors. Deploy a standard technology.
The bottom line is that to survive the permanent failure of a cloud provider, the app has to be provider neutral, not only to provide protection against provider failure, but also to permit migration of the application to a new cloud provider when your primary provider changes their price schedule, loses their revenue stream, or unexpectedly joins the deadpool[5].


[1] The Cloud - Outsourcing Moved up the Stack, Last In - First Out
[2]
Serious Cloud Storage Stumble for Flexscale, Datacenter Knowledge, 08/28/2008
[3] MediaMax/The Linkup Closes Its Doors, TechCrunch 07/10/2008

[4] ATT Frame Relay Network goes down for the count, Network World, 04/20/1998
[5] TechCrunch deadpool

Tuesday, November 25, 2008

Janke’s Official 2009 Technology Predictions

 

I’ll take Anton’s bait.

Here they are:

Prediction 1: The rate of adoption of IPV6 will greatly accelerate. Estimates of the final shutdown date for the last v4 global route will be moved up from ‘when hell freezes over’ to ‘long after I’m retired’, placing the problem right next to the Year 2038 Unix timestamp problem on CTO’s priority lists.

Prediction 2: Gadget freaks will continue to search for the holy grail of multifunction all-in-one gadgets. They will continue to be disappointed.

Prediction 3: Apple will announce a new product. The product will be generate a media frenzy. Apple fans will crash servers looking for the latest product leaks or fuzzy prototype pics, and arguing via blog comments the merits of the features the product may or may not have. Unfortunately the product will be missing cut and paste.

Prediction 4: Hardware and network vendors will continue making faster and cheaper bits at a rate that matches Moore's law. Software will continue to bloat at a rate just slightly faster than Moore's law, ensuring that state of the art software running on new hardware will be slightly slower than last year.

Prediction 5: Disks will double in capacity. The average file size will double. The number of files stored will also double. All hard drives on the planet will continue to be 95% full. No progress will be made toward identifying the owner, data classification, or destruction date of the files.

Prediction 6: There will be a major security panic over some widely used but inherently insecure Internet protocol. The problem will not get resolved.

Prediction 7: Touch screen devices will continue to collect fingerprints.

Prediction 8: Sun Microsystems will rename two of their core technologies, ensuring that their loyal customers will remain confused.

Prediction 9: Web Apps will continue to be deployed with a 1:1 ratio of new web applications to applications that are vulnerable to SQL injection, XSS or XSRF. A few new applications will not be vulnerable. The rest will make up for those few with multiple vulnerabilities, keeping the overall ratio constant.

Prediction 10: Virtualization will explode, replacing hundreds of thousands of real servers with virtual servers. Unfortunately, the number of virtual servers will grow so fast that the number of physical servers will not decrease, and all datacenters on the planet will continue to have cooling and power problems.

Prediction 11: Endless e-mail threads will continue to replace mindless meetings as the preferred venue for designing, building and maintaining complex systems. After-hours meetings at local brew pubs will continue to be the actual venue for designing, building and maintaining complex systems.

And – For the bonus prediction – Someone, somewhere will figure out how to define cloud computing. The rest of us will argue over the definition for at least another year.

Notice how I didn’t stick my neck out on any of theses predictions?

Sunday, November 23, 2008

The Power Consumption of Home Electronics

I learned something last week. Xbox and Playstation Game consoles are pathetically bad at energy consumption. The Wii doesn’t suck (power) quite as badly.

The Data:

The Natural Resources Defense Council did an interesting study[1] of game consoles and attempted to estimate annual energy usage and cost.

The good part:
GameConsoles
Ouch. Unlike half watt wall warts, a hundred and some odd watts might actually show up on your monthly electric bill. And from what NRDC can tell, the game consoles are not real good at powering themselves off when unused, which makes the problem worse.

This is really discouraging. The idea that energy consuming devices should automatically drop themselves down into a low-power state when idle isn’t new, yet we continue to build (and buy) devices with poor power management. I suspect that part of the problem is that there isn’t sufficient information available to consumers at the time of purchase to make a rational ‘green’ decision. Unlike refrigerators, clothes washers, and automobiles (here in the USA), energy consumption isn’t part of the marketing propaganda of most home electronics. 

It should be.

Someday smart retailers will figure out how to market energy costs on home electronics, much like they already do for large home appliances. For my last clothes washer/dryer (tumbler, to those on the wrong side of the pond) the sales dude tried to push me up to a higher cost model based on features. When I explained that for clothing related appliances, my feature requirements were a step above a rock in a river, he wisely and quickly pointed me to an expensive but efficient washer & dryer model.

Sold.

As for the report as a whole, I’m skeptical of the annual gross energy costs and savings shown in the report, mostly because the estimates are highly dependent on user actions. I suspect that we really don’t know how many game consoles are left on continuously versus powered down after each use, and more importantly, the NRDC doesn’t consider the cost of cooling the heat generated by the consoles in those parts of the country where air conditioning normally is used.

So if you are like my neighbors and you leave your air conditioner running all summer, your summertime gaming costs will be much higher. The hundred plus watts of heat needs more than a hundred plus watts of cooling. But if like me, you live in a climate where heating is the norm for more than half the year, the waste heat generated by the console gets subtracted from the heat that your furnace needs to supply, making the cost of gaming somewhat less.

In any case, don’t sweat the wall warts. Look around for things that suck up a hundred or more watts and unplug those.

12/17/2010: Scientific American published a similar article.
Smile

[1]Lowering the Cost of Play, Natural Resources Defense Council