Component | Failures per Year | Hours to Recover (component) | Hours to failover (redundant) | Hours Failed per year |
WAN Link | .5 | 8 | .05 | .025 |
Routers, Devices | .2 | 4 | .05 | .01 |
SAN Fabric | .2 | 4 | .01 | .002 |
SAN LUN | .1 | 12 | .12 | .12 |
Server | .5 | 8 | .05 | .025 |
Power/Cooling | .2 | 2 | 0 | 0 |
Notice nice small numbers in the far right column. Redundant systems tend to do a nice job of reducing MTTR.
Note that if you believe in complexity and the human factor, you might argue that because they are more complex, redundant systems have more failures. I'm sure this is true, but I haven't considered that for this post (yet). Note also that I consider SAN LUN failure to be the same as for the non-redundant case. I've considered that LUN's are always configured with some for on redundancy, even in the non-redundant scenario.
Now apply it to a typical redundant application stack. The stack mixed active/active, active passive. There are differences in failure rates of active/active and active/passive HA pairs, but for this post, the difference is minor enough to ignore. (Half the time, when a active/passive pair has a failure, the passive device is the one that failed, so service is unaffected. Active/active pairs therefor have a service affecting failure twice as often.)
Under normal operation, the green devices are active, and failure of any active device causes an application outage equal to the failover time of the high availability device pair.
The estimates for failure frequency and recovery time are:
Component | Failures per Year | Hours to failover (MTTR) | Hours Failed per year |
WAN Link | .5 | .05 | .025 |
Router | .2 | .05 | .01 |
Firewall | .2 | .05 | .01 |
Load Balancers | .2 | .05 | .01 |
Switch | .2 | .05 | .01 |
Web Server | .5 | .05 | .025 |
Switch | .2 | .05 | .01 |
Database Server | .5 | .05 | .025 |
SAN Fabric | .2 | .01 | .002 |
SAN LUN | .1 | .12 | .12 |
Power/Cooling | 1 | 0 | 0 |
Total | .25 = 15 min |
These numbers imply some assumptions about some of the components. For example, in this trivial case, I'm assuming that :
- The WAN links must not be in the same conduit, or terminate on the same device at the upstream ISP. Otherwise they would not be considered redundant. Also, the WAN links likely will be active/active, so the probability of failure will double.
- Networks, Layer 3 verses Layer 2. I'll go out on a limb here. In spite of what vendors promise, under most conditions, layer 2 redundancy (i.e. spanning tree managed link redundancy) does not have higher availability than a similarly designed network with layer 3 redundancy (routing protocols). My experience is that the phrase 'friends don't let friends run spanning tree' is true, and that the advantages gained by the link redundancy provided by layer 2 designs are outweighed by the increased probability of spanning tree loop related network failure.
- Power failures are assumed to be covered by battery/generator. But many hosting facilities still have service affecting power/cooling failures. If I were presenting theses numbers to a client, I'd factor in power and cooling somehow, perhaps by guesstimating a hour or so per year, depending on the facility
These numbers look pretty good, but don't start jumping up and down yet.
Humans are in the loop, but not accounted for in this calculation. As I indicated in the human factor, the human can, in the case where redundant systems are properly designed and deployed, be the largest cause of down time (the keyboard-chair interface is non-redundant). Also, there is no consideration for the application itself (bugs, bad failed deployments) or consideration for the database (performance problems, bugs). As indicated in previous post, a poorly designed application that is down every week because of performance issues or bugs isn't going to magically have fewer failures because it is moved to redundant systems. It will just be a poorly designed, redundant application.
Coupled Dependencies
Conclusions
- Structured System Management has the greatest influence on availability.
- With non-redundant but well managed systems, the human factor will be significant, but should not be the major cause of outages.
- With redundant, well managed systems, the human factor may be the largest cause of outages.
- A poorly designed or written application will not be improved by redundancy.