Skip to main content

Estimating the Availability of Simple Systems - Non-redundant

In the Introductory post to this series, I outlined the basics for estimating the availability of simple systems. This post picks up where the first post left off and attempts to look at availability estimates for non-redundant systems.

Let's go back to the failure estimate chart from the introductory post

Component Failures per Year Hours to Recover Hours Failed per year
WAN Link .5 8 4
Routers, Devices .2 4 .8
SAN Fabric .2 4 .8
SAN LUN .1 12 .12
Server .5 8 4
Power/Cooling 1 2 2

And apply it to a simple stack of three non-redundant devices in series. availability-miscAssuming that the devices are all 'boot from flash and no hard drive' we would apply the estimated failures per year and hours to recover in series for each device from the Routers/Devices row of the table. For series dependencies, where the availability of the stack depends on each of the devices in the stack, simply adding the estimated failures and recovery times together gives us an estimate for the entire stack.

For each device:

.2 failures/year * 4 hours to recover = .8 hours/year unavailability.

For three devices in series, each with approximately the same failure rate and recovery time, the unavailability estimate would be the sum of the unavailability of each component, or .8 + .8 + .8 = 2.4 hours/year.

Notice a critical consideration. The more non-redundant devices you stack up in series, the lower your availability. (Notice I made that sentence in bold and italics. I did that 'cause it's a really important concept.)

The non-redundant series dependencies also apply to other interesting places in a technology stack. For example, if I want my really big server to go faster, I add more memory modules so that the memory bus can stripe the memory access across more modules and spend less time waiting for memory access. Those memory modules are effectively serial, non-redundant. So for a fast server, we'd rather have 16 x 1GB DIMMs than 8 x 2GB DIMMs or 4 x 4GB DIMMs. The server with 16 x 1GB DIMMs will likely go faster than the server with 4 x 4DB DIMMs, but it will be 4 times as likely to have memory failure.

Let's look at a more interesting application stack, again a series of non-redundant components.

We'll assume that this is a T1 from a provider, a router, firewall, switch, application/web server, a database server and attached RAID array. The green path shows the dependencies for a successful user experience. The availability calculation is a simple sum of the product of failure frequency and recovery time of each component.

application-nonredundant
Component Failures per Year Hours to Recover Hours Failed per year
WAN Link 0.5 8 4.0
Router 0.2 4 0.8
Firewall 0.2 4 0.8
Switch 0.1 12 .12
Web Server 0.5 8 4.0
Database Server 0.5 8 4.0
RAID Array 0.1 12 .12
Power/Cooling 1.0 2 2.0
Total     15.8 hours

The estimate for this simple example works out to be about 16 hours of down time per year, not including any application issues, like performance problems, scalability issues.

  • The estimate also doesn't consider the human factor.
  • Because numbers we input into the chart are really rough, with plenty of room for error, the final estimate is also just a guesstimate.
  • The estimate is the average hours of outage over a number of years, not the number of hours of outage for each year. You could have considerable variation from year to year.

Applying the Estimate to the Real World

To apply this to the real world and estimate an availability number for the entire application, you'd have to know more about the application, the organization, and the persons managing the systems.

For example - assume that the application is secure and well written, and that there are no scalability issues, and assume that the application has version control, test, dev and QA implementations and a rigorous change management process. That application might suffer few if any application related outages in a typical year. Figure one bad deployment per year that causes 2 hours of down time. On the other hand, assume that it is poorly designed, that there is no source code control or structured deployment methodology, no test/QA/dev environments, and no change control. I've seen applications like that have a couple hours a week of down time.

And - if you consider the human factor, that the humans in the loop (the keyboard-chair interface) will eventually mis-configure a device, reboot the wrong server, fail to complete a change within the change window, etc., then you need to pad this number to take the humans into consideration.

On to Part Two (or back to the Introduction?)

Comments

Popular posts from this blog

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:Failing to understand the difference between necessary and sufficient. A daily backup is necessary, b…

Ad-Hoc Verses Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to …

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:
In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. The…