Friday, May 30, 2008

Failure Tolerance - the Google way

Another 'gotta think about this' article, this time via Stephen Shankland's news.com's summary of a Jeff Dean Google I/O presentation.

The first thing that piqued my interest is the raw failure rate of a new Google cluster. Jeff Dean is claiming that on new clusters, they have half as many failures as servers. (1800 servers, 1000 failures.) That seems high to me. Working with an order of magnitude fewer servers than a single Google cluster, we don't see anything like that kind of failure rate. Either there really is a difference in the reliability of cheap hardware, or I'm not comparing like numbers.

It's hard for me to imagine that failure rate working at a several orders of magnitude smaller scale. The incremental cost of 'normal' hardware over cheap hardware at the hundreds of servers scale isn't that large of a number. But Google has a cheap hardware, smart software mantra, so presumably the economics works out in their favor.


The second interesting bit is taking the Google-like server failure assumptions comparing them to network protocol failure assumptions. The TCP world assumes that losses will occur and the protocols need to gracefully and unobtrusively handle the loss. Wide area networks have always been like that. Wide area network engineers assume that there will be loss. The TCP protocol assumes that there will be loss. The protocols that route data and the end points of the session are smart. In network land, things work even with data loss and errors as high as 1%, and in HA network designs, we often assume that 50% of our hardware can fail and we'll still have a functional system.

In the case of routed wide area networks, it isn't necessarily cheap hardware that leads us to smart software (TCP, OSPF, BGP), it's unreliable circuits (the backhoe factor). In router and network land, smart software and redundancy compensates for unreliable WAN circuits. In theory the smart software should let us run cheap hardware. In reality, we tend to have both smart software and expensive hardware. But a high hardware failure rate isn't really an option for network redundancy. If you've only got a single pair of HA routers 300 or 3000 miles away, and one dies, you've used up the +1 on your n+1 design. And it would be hard to build a remote routed node with more than a pair of redundant routers, mostly because the incoming telco circuit count would have to increase, and those things are expensive.

Compare that to SAN fabrics, file systems and relational databases, where a single missing bit is a catastrophe. There is simply no 'assumption of loss' in most application and database designs. The lack of loss tolerant designs in databases and file systems drives us to a model of both expensive hardware and expensive software.

There probably are a whole slew of places where we could take the engineering assumptions of one discipline apply them to an marginally related or unrelated discipline, with game-changing results.
(via Data Center Knowledge)

3 comments:

  1. I suppose it depends on what you consider a "failure". If I judged everything that went wrong on my network as a failure, well, lets just say I wouldn't want to report that number.

    Also, they're counting a PDU failure as 1 failure for each machine that it takes down. If your PDUs are powering 1000 machines, then yes, I can see that number being statistically likelier.

    Anyway, that kind of scale is just a different kind of computing from what most people do.

    ReplyDelete
  2. "Compare that to SAN fabrics, file systems and relational databases, where a single missing bit is a catastrophe. There is simply no 'assumption of loss' in most application and database designs. The lack of loss tolerant designs in databases and file systems drives us to a model of both expensive hardware and expensive software."

    What? SAN fabrics are fabrics in that they have multiple paths and can withstand any single device failing. Fault tolerant file systems and databases have been around for years. You just pay for it all in speed.

    ReplyDelete
  3. @Anonymous -

    I am comparing TCP-like network protocols, where .1% or even 1% packet loss is assumed, and the protocol and applications are built to work around the loss - to SCSI, Fiber Channel and the various file systems, where packet and/or data loss is an exceptional event that causes great problems.

    When encountering data loss, network protocols simply retransmit and continue on. File systems panic, crash and dismount, and kernels panic.

    But as you indicate, because file storage protocols are generally not loss or failure tolerant, we've built them on top of redundant (and expensive) highly available components and technologies.

    Is there a database or file system that when encountering a bad checksum or missing block of data, simply asks for it again? Or do they tend to roll over and die?

    ReplyDelete