Skip to main content

Failure Tolerance - the Google way

Another 'gotta think about this' article, this time via Stephen Shankland's's summary of a Jeff Dean Google I/O presentation.

The first thing that piqued my interest is the raw failure rate of a new Google cluster. Jeff Dean is claiming that on new clusters, they have half as many failures as servers. (1800 servers, 1000 failures.) That seems high to me. Working with an order of magnitude fewer servers than a single Google cluster, we don't see anything like that kind of failure rate. Either there really is a difference in the reliability of cheap hardware, or I'm not comparing like numbers.

It's hard for me to imagine that failure rate working at a several orders of magnitude smaller scale. The incremental cost of 'normal' hardware over cheap hardware at the hundreds of servers scale isn't that large of a number. But Google has a cheap hardware, smart software mantra, so presumably the economics works out in their favor.

The second interesting bit is taking the Google-like server failure assumptions comparing them to network protocol failure assumptions. The TCP world assumes that losses will occur and the protocols need to gracefully and unobtrusively handle the loss. Wide area networks have always been like that. Wide area network engineers assume that there will be loss. The TCP protocol assumes that there will be loss. The protocols that route data and the end points of the session are smart. In network land, things work even with data loss and errors as high as 1%, and in HA network designs, we often assume that 50% of our hardware can fail and we'll still have a functional system.

In the case of routed wide area networks, it isn't necessarily cheap hardware that leads us to smart software (TCP, OSPF, BGP), it's unreliable circuits (the backhoe factor). In router and network land, smart software and redundancy compensates for unreliable WAN circuits. In theory the smart software should let us run cheap hardware. In reality, we tend to have both smart software and expensive hardware. But a high hardware failure rate isn't really an option for network redundancy. If you've only got a single pair of HA routers 300 or 3000 miles away, and one dies, you've used up the +1 on your n+1 design. And it would be hard to build a remote routed node with more than a pair of redundant routers, mostly because the incoming telco circuit count would have to increase, and those things are expensive.

Compare that to SAN fabrics, file systems and relational databases, where a single missing bit is a catastrophe. There is simply no 'assumption of loss' in most application and database designs. The lack of loss tolerant designs in databases and file systems drives us to a model of both expensive hardware and expensive software.

There probably are a whole slew of places where we could take the engineering assumptions of one discipline apply them to an marginally related or unrelated discipline, with game-changing results.
(via Data Center Knowledge)


  1. I suppose it depends on what you consider a "failure". If I judged everything that went wrong on my network as a failure, well, lets just say I wouldn't want to report that number.

    Also, they're counting a PDU failure as 1 failure for each machine that it takes down. If your PDUs are powering 1000 machines, then yes, I can see that number being statistically likelier.

    Anyway, that kind of scale is just a different kind of computing from what most people do.

  2. "Compare that to SAN fabrics, file systems and relational databases, where a single missing bit is a catastrophe. There is simply no 'assumption of loss' in most application and database designs. The lack of loss tolerant designs in databases and file systems drives us to a model of both expensive hardware and expensive software."

    What? SAN fabrics are fabrics in that they have multiple paths and can withstand any single device failing. Fault tolerant file systems and databases have been around for years. You just pay for it all in speed.

  3. @Anonymous -

    I am comparing TCP-like network protocols, where .1% or even 1% packet loss is assumed, and the protocol and applications are built to work around the loss - to SCSI, Fiber Channel and the various file systems, where packet and/or data loss is an exceptional event that causes great problems.

    When encountering data loss, network protocols simply retransmit and continue on. File systems panic, crash and dismount, and kernels panic.

    But as you indicate, because file storage protocols are generally not loss or failure tolerant, we've built them on top of redundant (and expensive) highly available components and technologies.

    Is there a database or file system that when encountering a bad checksum or missing block of data, simply asks for it again? Or do they tend to roll over and die?


Post a Comment

Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …