Skip to main content

Availability, Complexity and the Person-Factor

I am trying out a new hypothesis:

When person-resources are constrained, highest availability is achieved when the system is designed with the minimum complexity necessary to meet availability requirements.
My hypothesis, that minimizing complexity maximizes availability, assumes that in an environment where the number of persons is constrained or fixed, as systems become more complex the human factors in system failure and resolution become more important than technology factors.

This hypothesis also assumes that increased system availability generally presumes an increase in complexity. I am basing this on a combination of a simple analysis of availability combined with extensive experience managing technology.

Person Resources vs Complexity

  • As availability requirements are increased, the technology required becomes more complex.
  • As the technology gets more complex, the person-resources to manage the technology increases.
  • Resources are generally constrained, so the ideal resource allocation is unlikely to occur in the real world.

In an ideal organization, as system availability requirements go up, both person-resources and technology resources would increase as necessary to support the increased availability requirement. The relationship between availability and person resources should look something like the first chart. In theory the initial investment in structured system management and simple redundancy will result in a large improvement in availability relative to the resources spent. Moving from ad-hoc system management toward structured system management will result in fewer unplanned downtimes and less time spent troubleshooting problems, so both MTBF and MTTR should improve. Moving from non-redundant systems to simple redundancy (load balanced app servers, active/passive firewall and network failover, active/passive clustering, etc) will result in faster recovery time on failures, so even though the MTBF will not improve, MTTR will improve, therefore availability will improve.

When simple active/passive redundancy is no longer adequate to achieve required availability, system complexity is greatly increased. Availability targets that require active/active clustering, multi-homed servers, redundant data centers, layer-2 network redundancy with sub-minute recovery times require more person-resource relative to the resulting increase in availability. If person-resources are added along with the necessary technology resources, the availability will continue to increase.

If however, person-resources are not available to support the increase complexity brought on by the increased availability requirements, the availability curve will look something like the second chart. The systems will be complex to manage, but the existing person-resources, if not supplemented, will be unable to adequately design, test, deploy the more complex environment. Most importantly though, in the event of a failure of the more complex environment, more time will be spent troubleshooting and resolving problems, potentially increasing MTTR and decreasing availability.

This may be nothing more that a restatement of the K.I.S.S principle.

Related Posts:

Estimating Availability of Simple Systems – Introduction

Estimating Availability of Simple Systems - Non-redundant

Estimating Availability of Simple Systems – Redundant

Availability, Longer MTBF and shorter MTTR

Availability - MTBF, MTTR and the Human Factor

(2008-07-19 -Updated links, minor edits)


Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …