Skip to main content

Estimating the Availability of Simple Systems - Redundant

 
This is a continuation of a series of posts that attempt to provide the basics of estimating the availability of various simple systems. The Introduction covered the fundamentals, Part One covered estimating the availability of non-redundant systems. This post will attempt to cover simple redundant systems.
 
Let's go back to the failure estimate chart from the introductory post, but this time modify it for redundant (active/passive) redundancy. Remember that for redundant components, the number of failures is the same (the MTBF doesn't change), but the time to recover (MTTR) is shortened dramatically. The MTTR is no longer the time it takes to determine the cause of the failure and replace the failed part, but rather the MTTR is the time that it takes to fail over to the redundant component.
 
Component Failures per Year Hours to Recover (component) Hours to failover
(redundant)
Hours Failed per year
WAN Link .5 8 .05 .025
Routers, Devices .2 4 .05 .01
SAN Fabric .2 4 .01 .002
SAN LUN .1 12 .12 .12
Server .5 8 .05 .025
Power/Cooling .2 2 0 0

Notice nice small numbers in the far right column. Redundant systems tend to do a nice job of reducing MTTR.

Note that if you believe in complexity and the human factor, you might argue that because they are more complex, redundant systems have more failures. I'm sure this is true, but I haven't considered that for this post (yet). Note also that I consider SAN LUN failure to be the same as for the non-redundant case. I've considered that LUN's are always configured with some for on redundancy, even in the non-redundant scenario.

redundant-application-stack-pathNow apply it to a typical redundant application stack. The stack mixed active/active, active passive. There are differences in failure rates of active/active and active/passive HA pairs, but for this post, the difference is minor enough to ignore. (Half the time, when a active/passive pair has a failure, the passive device is the one that failed, so service is unaffected. Active/active pairs therefor have a service affecting failure twice as often.)

Under normal operation, the green devices are active, and failure of any active device causes an application outage equal to the failover time of the high availability device pair.

The estimates for failure frequency and recovery time are:

Component Failures per Year Hours to failover
(MTTR)
Hours Failed per year
WAN Link .5 .05 .025
Router .2 .05 .01
Firewall .2 .05 .01
Load Balancers .2 .05 .01
Switch .2 .05 .01
Web Server .5 .05 .025
Switch .2 .05 .01
Database Server .5 .05 .025
SAN Fabric .2 .01 .002
SAN LUN .1 .12 .12
Power/Cooling 1 0 0
Total     .25 = 15 min

These numbers imply some assumptions about some of the components. For example, in this trivial case, I'm assuming that :

  • The WAN links must not be in the same conduit, or terminate on the same device at the upstream ISP. Otherwise they would not be considered redundant. Also, the WAN links likely will be active/active, so the probability of failure will double.
  • Networks, Layer 3 verses Layer 2. I'll go out on a limb here. In spite of what vendors promise, under most conditions, layer 2 redundancy (i.e. spanning tree managed link redundancy) does not have higher availability than a similarly designed network with layer 3 redundancy (routing protocols). My experience is that the phrase 'friends don't let friends run spanning tree' is true, and that the advantages gained by the link redundancy provided by layer 2 designs are outweighed by the increased probability of spanning tree loop related network failure.
  • Power failures are assumed to be covered by battery/generator. But many hosting facilities still have service affecting power/cooling failures. If I were presenting theses numbers to a client, I'd factor in power and cooling somehow, perhaps by guesstimating a hour or so per year, depending on the facility

These numbers look pretty good, but don't start jumping up and down yet.

Humans are in the loop, but not accounted for in this calculation. As I indicated in the human factor, the human can, in the case where redundant systems are properly designed and deployed, be the largest cause of down time (the keyboard-chair interface is non-redundant). Also, there is no consideration for the application itself (bugs, bad failed deployments) or consideration for the database (performance problems, bugs). As indicated in previous post, a poorly designed application that is down every week because of performance issues or bugs isn't going to magically have fewer failures because it is moved to redundant systems. It will just be a poorly designed, redundant application. 

Coupled Dependencies

 
availability-failedA quick note on coupled dependencies. In the example above, the design is such that the load balancer, firewall and router are coupled. (In this design they are, in other designs they are not). A hypothetical failure of the active firewall would result in a firewall failover, a load balancer failover, and perhaps a router HSRP failover. The MTTR would be the time it takes for all three devices to figure out who is active.
 
Coupled dependencies tend to cause unexpected outages themselves. Typically, when designing systems with coupled dependencies, thorough testing is needed to uncover unexpected interactions between the coupled devices. (In the case shown here interaction between HRSP, the routing protocol, the active/passive firewall, and layer 2 redundancy at the switch layer is complex enough to be worth a day in the lab.)

Conclusions

  • Structured System Management has the greatest influence on availability.
  • With non-redundant but well managed systems, the human factor will be significant, but should not be the major cause of outages.
  • With redundant, well managed systems, the human factor may be the largest cause of outages.
  • A poorly designed or written application will not be improved by redundancy.
Keep in mind that the combination of application, human and database outages, not considered in the calculation, will far outweigh simple hardware and operating system failures. For your estimations, you will have to add failures and recovery time for human failure, application failure and database failure. (Hint - figure a couple hours each per year.)
 
As indicated in the introductory post, I tried.
 
Back to the Introduction (or the previous post).

Comments

Popular posts from this blog

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:Failing to understand the difference between necessary and sufficient. A daily backup is necessary, b…

Ad-Hoc Verses Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to …

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:
In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. The…