Last In - First Out: Availability - MTBF, MTTR and the Human Factor

We've got a couple of data centers. By most measures, they are small, at just over 100 servers between the two of them. We've been working for the last year to build the second data center with the specific design goal of being able to fail over mission critical applications to the secondary data center during extended outages. Since the goal of the second data center is specifically to improve application availability, I decided to rough in some calculations on what kind of design we'd need to meet our availability and recovery goals.

Start with MTBF (Mean Time Between Failure). In theory, each component that makes up the infrastructure that supports an application has an MTBF, typically measured in thousands of hours. The second factor in calculating availability is MTTR. (Mean Time to Recovery or Repair), measured in hours or minutes. Knowing the average time between failures, and the average time to repair or recover from the failure, one should be able to make and approximate prediction of the availability of a particular component.

Systems, no matter how complex, are made up of components. In theory, one can calculate the MTBF and MTTR of each component and derive the availability of an entire system. This can get to be a fairly involved calculation, and I'm sure somewhere there are people who make a living doing MTBF/MTTR calculations.

If this is at all interesting, read eventhelix.com's description of MTBF/MTTR and availability. Then follow along with my calculations, and amuse yourself at the conclusion. Remember, all the calculations are real rough. I was trying to get an order of magnitude estimate based on easily obtainable data, not an exact number.

Raw MTBF Data:

Single Hard Drive, high end manufacturer:	1.4 million hours
Single DIMM, moderately priced manufacturer:	4 million hours
Single CPU:	1 million hours
Single network device (switch):	200,000 hours
Single person, non-redundant:	2000 hours

OK - the Single Person, non-redundant I just made up. But I figure that each employee probably will, if left alone with no constraints (change review, change management) will screw up once a year.

MTTR is simply an estimate of how long it takes to restore service in the event of a failure. For example, in the case of a failed non-redundant HDD that has a server OS and lots of data on it, that the MTTR would be the time it takes to call the vendor, have a spare delivered, rebuild the server and recover everything from tape backup. Figure 8 hours for a simple server. Or, in the case of a memory DIMM, I figure on hour to get through the tech support queue at the manufacturer, one hour to argue with them about your contract terms & conditions, 4 hours to deliver the DIMM, an hour to install and re-boot, or about 7 hours.

In the case of redundant or clustered devices, the MTTR is effectively the time it takes for the clustering or high availability software to figure out what has failed and take the failed device off line or route around it. Figure somewhere between 1 and 3 minutes.

To convert MTBF and MTTR to availability, some math is needed.

If systems have more than one component, and if all components are required for the system to operate, then the availability of the system is calculated by multiplying the availability of each component together to obtain the system availability. If the components are redundant, and if each of the redundant components is fully functional alone, then the availability of the system is (1 - (1-A)ⁿ). (Or - use an online calculator.)

Obviously in the case of redundant components, the system is only failed when both redundant components fail at the same time. That tends to be rare, so redundant systems can have much better availability.

How does this work out using some rough numbers?

First of all, an interesting problem immediately becomes evident. For maximum performance, I'd build a server with more CPU's, more memory DIMMS, more network cards, and more disk controllers. The component count goes up. If I build the server without redundancy, the server is less reliable. But faster. Most server vendors compensate for this by building some components with error correction capability, so that even thought component fails, the system still runs. But more components, in theory, will result in a less reliable server. So in rough numbers, using only memory DIMMs in the calculation, and assuming no error correction or redundancy:

DIMM Count	MTBF (Hours)	MTTR (Hours)	Availability
1	4 million	8 hours	99.9998%
4	4 million	8 hours	99.9992%
8	4 million	8 hours	99.9984%
16	4 million	8 hours	99.9968%
32	4 million	8 hours	99.9936%
64	4 million	8 hours	99.9972%

Adding non-redundant components doesn't help availability. A high end server with lots of DIMM's, is already at four 9's, without even considering all the other components in the system.

In clustered, or high availability systems, the availability calculation changes dramatically. In the simple case of active/passive failover, like Cisco firewalls or Microsoft clustering, the MTBF doesn't change, but the MTTR does. The MTTR is essentially the cluster failover time, usually a minute or so. Now the time that it takes the tech to show up with parts is no longer relevant.

Take the same numbers above, but make the MTTR 3 minutes (cluster failover time) and the theoretical availability goes way up.

DIMM Count	MTBF (Hours)	MTTR (Hours)	Availability
1	4 million	.05 hours	99.9999%
4	4 million	.05 hours	99.9999%
8	4 million	.05 hours	99.9999%
16	4 million	.05 hours	99.9999%
32	4 million	.05 hours	99.9999%
64	4 million	.05 hours	99.9999%

Active/passive clustering doesn't change MTBF, but is does change MTTR, and therefore availability. I ran through a handful of calculations, and figure out that if all we have is a reasonable level of active/passive redundancy on devices and servers, and a reasonably well designed network and storage, we should be able to meet our availability goals.

Except......

The Human Factor:

So I was well on my way down this path, trying to determine if Layer-3 WAN redundancy was adequate to meet availability goals, and if active/passive clusters combined with dual switch fabrics and HA load balancers and firewalls will result in MTBF and MTTR's that make sense, when I decided to figure in the human factor.

I figure that the more complex the system, the higher the probability of a human induced failure, and that humans have a failure rate that is based not only on skill and experience, and also on the structure of the processes that are used to manage the systems.

Assume, as above, that a person makes one error per year, and that the error results in 30 minutes of downtime. That already drops you down to 99.99% availability. That's without considering any hardware, software, power or other outages. If you figure that you have a handful of persons that are in a position to make an error that results in downtime, you've got a problem.

Fortunately, persons can be made redundant, and processes (Change Management) can be designed to enforce person-level redundancy.

So now I'm thinking about Change Management, QA environments, test labs.

(2008-07-19 -Updated links, minor edits)