Failure Tolerance - the Google way

Another 'gotta think about this' article, this time via Stephen Shankland's news.com's summary of a Jeff Dean Google I/O presentation.

The first thing that piqued my interest is the raw failure rate of a new Google cluster. Jeff Dean is claiming that on new clusters, they have half as many failures as servers. (1800 servers, 1000 failures.) That seems high to me. Working with an order of magnitude fewer servers than a single Google cluster, we don't see anything like that kind of failure rate. Either there really is a difference in the reliability of cheap hardware, or I'm not comparing like numbers.

It's hard for me to imagine that failure rate working at a several orders of magnitude smaller scale. The incremental cost of 'normal' hardware over cheap hardware at the hundreds of servers scale isn't that large of a number. But Google has a cheap hardware, smart software mantra, so presumably the economics works out in their favor.


The second interesting bit is taking the Google-like server failure assumptions comparing them to network protocol failure assumptions. The TCP world assumes that losses will occur and the protocols need to gracefully and unobtrusively handle the loss. Wide area networks have always been like that. Wide area network engineers assume that there will be loss. The TCP protocol assumes that there will be loss. The protocols that route data and the end points of the session are smart. In network land, things work even with data loss and errors as high as 1%, and in HA network designs, we often assume that 50% of our hardware can fail and we'll still have a functional system.

In the case of routed wide area networks, it isn't necessarily cheap hardware that leads us to smart software (TCP, OSPF, BGP), it's unreliable circuits (the backhoe factor). In router and network land, smart software and redundancy compensates for unreliable WAN circuits. In theory the smart software should let us run cheap hardware. In reality, we tend to have both smart software and expensive hardware. But a high hardware failure rate isn't really an option for network redundancy. If you've only got a single pair of HA routers 300 or 3000 miles away, and one dies, you've used up the +1 on your n+1 design. And it would be hard to build a remote routed node with more than a pair of redundant routers, mostly because the incoming telco circuit count would have to increase, and those things are expensive.

Compare that to SAN fabrics, file systems and relational databases, where a single missing bit is a catastrophe. There is simply no 'assumption of loss' in most application and database designs. The lack of loss tolerant designs in databases and file systems drives us to a model of both expensive hardware and expensive software.

There probably are a whole slew of places where we could take the engineering assumptions of one discipline apply them to an marginally related or unrelated discipline, with game-changing results.
(via Data Center Knowledge)