Estimating the Availability of Simple Systems - Redundant

 
This is a continuation of a series of posts that attempt to provide the basics of estimating the availability of various simple systems. The Introduction covered the fundamentals, Part One covered estimating the availability of non-redundant systems. This post will attempt to cover simple redundant systems.
 
Let's go back to the failure estimate chart from the introductory post, but this time modify it for redundant (active/passive) redundancy. Remember that for redundant components, the number of failures is the same (the MTBF doesn't change), but the time to recover (MTTR) is shortened dramatically. The MTTR is no longer the time it takes to determine the cause of the failure and replace the failed part, but rather the MTTR is the time that it takes to fail over to the redundant component.
 
Component Failures per Year Hours to Recover (component) Hours to failover
(redundant)
Hours Failed per year
WAN Link .5 8 .05 .025
Routers, Devices .2 4 .05 .01
SAN Fabric .2 4 .01 .002
SAN LUN .1 12 .12 .12
Server .5 8 .05 .025
Power/Cooling .2 2 0 0

Notice nice small numbers in the far right column. Redundant systems tend to do a nice job of reducing MTTR.

Note that if you believe in complexity and the human factor, you might argue that because they are more complex, redundant systems have more failures. I'm sure this is true, but I haven't considered that for this post (yet). Note also that I consider SAN LUN failure to be the same as for the non-redundant case. I've considered that LUN's are always configured with some for on redundancy, even in the non-redundant scenario.

redundant-application-stack-pathNow apply it to a typical redundant application stack. The stack mixed active/active, active passive. There are differences in failure rates of active/active and active/passive HA pairs, but for this post, the difference is minor enough to ignore. (Half the time, when a active/passive pair has a failure, the passive device is the one that failed, so service is unaffected. Active/active pairs therefor have a service affecting failure twice as often.)

Under normal operation, the green devices are active, and failure of any active device causes an application outage equal to the failover time of the high availability device pair.

The estimates for failure frequency and recovery time are:

Component Failures per Year Hours to failover
(MTTR)
Hours Failed per year
WAN Link .5 .05 .025
Router .2 .05 .01
Firewall .2 .05 .01
Load Balancers .2 .05 .01
Switch .2 .05 .01
Web Server .5 .05 .025
Switch .2 .05 .01
Database Server .5 .05 .025
SAN Fabric .2 .01 .002
SAN LUN .1 .12 .12
Power/Cooling 1 0 0
Total     .25 = 15 min

These numbers imply some assumptions about some of the components. For example, in this trivial case, I'm assuming that :

  • The WAN links must not be in the same conduit, or terminate on the same device at the upstream ISP. Otherwise they would not be considered redundant. Also, the WAN links likely will be active/active, so the probability of failure will double.
  • Networks, Layer 3 verses Layer 2. I'll go out on a limb here. In spite of what vendors promise, under most conditions, layer 2 redundancy (i.e. spanning tree managed link redundancy) does not have higher availability than a similarly designed network with layer 3 redundancy (routing protocols). My experience is that the phrase 'friends don't let friends run spanning tree' is true, and that the advantages gained by the link redundancy provided by layer 2 designs are outweighed by the increased probability of spanning tree loop related network failure.
  • Power failures are assumed to be covered by battery/generator. But many hosting facilities still have service affecting power/cooling failures. If I were presenting theses numbers to a client, I'd factor in power and cooling somehow, perhaps by guesstimating a hour or so per year, depending on the facility

These numbers look pretty good, but don't start jumping up and down yet.

Humans are in the loop, but not accounted for in this calculation. As I indicated in the human factor, the human can, in the case where redundant systems are properly designed and deployed, be the largest cause of down time (the keyboard-chair interface is non-redundant). Also, there is no consideration for the application itself (bugs, bad failed deployments) or consideration for the database (performance problems, bugs). As indicated in previous post, a poorly designed application that is down every week because of performance issues or bugs isn't going to magically have fewer failures because it is moved to redundant systems. It will just be a poorly designed, redundant application. 

Coupled Dependencies

 
availability-failedA quick note on coupled dependencies. In the example above, the design is such that the load balancer, firewall and router are coupled. (In this design they are, in other designs they are not). A hypothetical failure of the active firewall would result in a firewall failover, a load balancer failover, and perhaps a router HSRP failover. The MTTR would be the time it takes for all three devices to figure out who is active.
 
Coupled dependencies tend to cause unexpected outages themselves. Typically, when designing systems with coupled dependencies, thorough testing is needed to uncover unexpected interactions between the coupled devices. (In the case shown here interaction between HRSP, the routing protocol, the active/passive firewall, and layer 2 redundancy at the switch layer is complex enough to be worth a day in the lab.)

Conclusions

  • Structured System Management has the greatest influence on availability.
  • With non-redundant but well managed systems, the human factor will be significant, but should not be the major cause of outages.
  • With redundant, well managed systems, the human factor may be the largest cause of outages.
  • A poorly designed or written application will not be improved by redundancy.
Keep in mind that the combination of application, human and database outages, not considered in the calculation, will far outweigh simple hardware and operating system failures. For your estimations, you will have to add failures and recovery time for human failure, application failure and database failure. (Hint - figure a couple hours each per year.)
 
As indicated in the introductory post, I tried.
 
Back to the Introduction (or the previous post).