Skip to main content

Simple Steps to Improving Availability - Five Essential Transitions

A poorly managed high availability cluster will have lower availability than a properly managed non-redundant system.

That's a bold statement, but I'm pretty sure it's true. The bottom line is that the path to improving system availability begins with the fundamentals of system management, not with redundant or HA systems. Only after you have executed the fundamentals will clustering or high availability make a positive contribution to system availability.

Here's the five transitions that are the critical steps on the path to improved availability:

Transition #1: From ad-hoc system management to structured system management.

Structured system management implies that you understand the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications, that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. Ad hoc system management doesn't cut it.

Transition #2: From ad-hoc changes to simple change management.

Simple change management means that you have controls around changes sufficient to determine who/what/when/why on any change to any system or application critical file. Changes are predicted. Changes are documented. Changes are not random, and they do not 'just happen'. A text file 'changes.txt' edited with notepad.exe and stored in c:/changelog/ is not as comprehensive as a million dollar consultant-driven enterprise CMDB that takes years to implement, but is a huge step in the right direction, and certainly provides more incremental value at less cost than the big solution.

Transition #3: From 'i dunno....maybe.....' to root cause analysis.

Failures have a cause. All of them. The 'cosmic ray did it' excuse is bullshit. Find the root cause. Fix the core problem. You need to be able to determine that 'the event was caused by .... and can be resolved by ... and can be prevented from ever happening again by ...'. If you cannot find the cause and you have to resort to killing a process or rebooting a server to restore service, then you must add instrumentation, monitoring or debugging to your system sufficient so that the next time the event happens, you will find the cause.

Transition #4: From 'try it...I think it will work..' to 'my tests show that......'

Comprehensive pre-production testing ensures that the systems that you build and the changes that you make will work as expected. You know that they will work because you tested them, and in the rare case that they do not work as expected, you will be able do determine the variation between test and production devise a test that accommodates the differences.

Transition #5: From non-redundant systems to simple redundancy.

Finally, after you've made transitions one through four, you are ready for implementation of basic active/passive redundancy. Skipping ahead to transition #5 isn't going to get you to your availability goals any sooner.

Remember though, keep it simple. Complexity doesn't necessarily increase availability.



Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …