Tuesday, November 24, 2009

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia

cargo_cult During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.

The question is, how amusing are we?

We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?

Here’s some common system administration failures that might be ‘cargo cult’:

Failing to understand the difference between necessary and sufficient. A daily backup is necessary, but it may not be sufficient to meet RPO and RTO requirements.

Failing to understand the difference between causation and correlation.[4] Event A may have caused Event B, or some third event may have caused A and B, or the two events may be unrelated and coincidental.

Failing to understand the difference between cause and effect.

Following a security recipe without understanding the risks you are addressing.  If you don't understand how hackers infiltrate your systems and ex-filtrate your data, then your DLP, Firewalls, IDS, SEIM, etc. are cargo cult. You've built the superficial exterior of a system without understanding the underlying substance. If you do understand how your systems get infiltrated, then you'll probably consider simple controls like database and file system permissions and auditing as important as expensive, complex packaged products.

Asserting that [Technology O] or [Platform L] or [Methodology A] is inherently superior to all others and blindly applying it to all problems. When you make such claims, are you applying science or religion?

Systematic troubleshooting is one of the hardest parts of system management and often the first to 'go cargo'. Here’s some examples:

Treating symptoms, not causes. A reboot will not solve your problem. It may make the problem go away for a while, but your problem still exists. You've addressed the symptom of the problem (memory fragmentation, for example), not the cause of the problem (a memory leak, for example).

Troubleshooting without a working hypothesis.

Changing more than one thing at a time while troubleshooting. If you make six changes and the problem went away, how will you determine root cause? Or worse, which of the six changes will cause new problems at a future date?

Making random changes while troubleshooting. Suppose you have a problem with an (application|operating system|database) and you hypothesize that changing a parameter will resolve the problem, so you change the parameter. If the problem reoccurs your hypothesis was wrong, right?

Troubleshooting without measurements or data.

Troubleshooting without being able to recreate the problem.

Troubleshooting application performance without a benchmark to compare performance against. If you don’t know what’s normal, how do you know what’s not normal?

Blaming the (network|firewall|storage) without analysis or hypothesis that points to either. One of our application vendors insisted that the 10mbps of traffic on a 100mbps interface was the cause of the slow application, and we needed to upgrade to GigE. We upgraded it (overnight), just to shut them up. Of course it didn't help. Their app was broke.

Blaming the user or the customer, without an analysis or hypothesis that points to them as the root cause. A better plan would be actually find the problem and fix it.

Declaring that the problem is fixed without determining the root cause. If you don't know the root cause, but the problem appears to have gone away, you haven't solved the problem, you've only observed that the problem went away. Don't worry, it'll come back, just after you’ve written an e-mail to management describing how you’ve “fixed” the problem.

It's easy to fall into cargo cult mode.

Just re-boot it, it'll be fine.


[1] Richard Fenymen, CARGO CULT SCIENCE: http://www.lhup.edu/~DSIMANEK/cargocul.htm
[2]
Mike Speiser, Cargo Cult Managment: http://gigaom.com/2009/06/21/cargo-cult-management/
[3] Wikipedia: Cargo Cult Programming: http://en.wikipedia.org/wiki/Cargo_cult_programming
[4] L. KIP WHEELER, Correlation and Causation: http://cnweb.cn.edu/kwheeler/logic_causation.html

5 comments:

  1. This is a good entry, and it brings up good points. I specifically asked Tom Limoncelli if he thought that his new concept of Design Patterns for System Administrators might lead to Cargo Cult System Administration, when I interviewed him before the LISA conference.

    He said "I look forward to that kind of problem."

    ReplyDelete
  2. I feel these are symptoms of a lack of professionalism. Often as a result of being young, or young in the profession, or sometimes I imagine, I haven't seen it, is laziness.

    I have a question too. What are your strategies for telling your boss that,the problems gone away, you don't know what it was and it may come back. There are times when entropy strikes, and the problem doesn't come back again until months later. There are some legitimate reasons for not finding the root cause, mostly non-reproducibility. What does one do in that situation, what would you do?

    ReplyDelete
  3. Bruno -

    I'd agree, both inexperience and laziness are contributors.

    I'll use language similar to what you just did, making sure that it is understood that the problem went away, but has not been resolved.

    I have some Oracle support incidents open for over a year, because of the infrequent occurrence of the bug and the difficulty of gathering data and troubleshooting.

    ReplyDelete