Skip to main content

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia

cargo_cult During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.

The question is, how amusing are we?

We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?

Here’s some common system administration failures that might be ‘cargo cult’:

Failing to understand the difference between necessary and sufficient. A daily backup is necessary, but it may not be sufficient to meet RPO and RTO requirements.

Failing to understand the difference between causation and correlation.[4] Event A may have caused Event B, or some third event may have caused A and B, or the two events may be unrelated and coincidental.

Failing to understand the difference between cause and effect.

Following a security recipe without understanding the risks you are addressing.  If you don't understand how hackers infiltrate your systems and ex-filtrate your data, then your DLP, Firewalls, IDS, SEIM, etc. are cargo cult. You've built the superficial exterior of a system without understanding the underlying substance. If you do understand how your systems get infiltrated, then you'll probably consider simple controls like database and file system permissions and auditing as important as expensive, complex packaged products.

Asserting that [Technology O] or [Platform L] or [Methodology A] is inherently superior to all others and blindly applying it to all problems. When you make such claims, are you applying science or religion?

Systematic troubleshooting is one of the hardest parts of system management and often the first to 'go cargo'. Here’s some examples:

Treating symptoms, not causes. A reboot will not solve your problem. It may make the problem go away for a while, but your problem still exists. You've addressed the symptom of the problem (memory fragmentation, for example), not the cause of the problem (a memory leak, for example).

Troubleshooting without a working hypothesis.

Changing more than one thing at a time while troubleshooting. If you make six changes and the problem went away, how will you determine root cause? Or worse, which of the six changes will cause new problems at a future date?

Making random changes while troubleshooting. Suppose you have a problem with an (application|operating system|database) and you hypothesize that changing a parameter will resolve the problem, so you change the parameter. If the problem reoccurs your hypothesis was wrong, right?

Troubleshooting without measurements or data.

Troubleshooting without being able to recreate the problem.

Troubleshooting application performance without a benchmark to compare performance against. If you don’t know what’s normal, how do you know what’s not normal?

Blaming the (network|firewall|storage) without analysis or hypothesis that points to either. One of our application vendors insisted that the 10mbps of traffic on a 100mbps interface was the cause of the slow application, and we needed to upgrade to GigE. We upgraded it (overnight), just to shut them up. Of course it didn't help. Their app was broke.

Blaming the user or the customer, without an analysis or hypothesis that points to them as the root cause. A better plan would be actually find the problem and fix it.

Declaring that the problem is fixed without determining the root cause. If you don't know the root cause, but the problem appears to have gone away, you haven't solved the problem, you've only observed that the problem went away. Don't worry, it'll come back, just after you’ve written an e-mail to management describing how you’ve “fixed” the problem.

It's easy to fall into cargo cult mode.

Just re-boot it, it'll be fine.


[1] Richard Fenymen, CARGO CULT SCIENCE: http://www.lhup.edu/~DSIMANEK/cargocul.htm
[2]
Mike Speiser, Cargo Cult Managment: http://gigaom.com/2009/06/21/cargo-cult-management/
[3] Wikipedia: Cargo Cult Programming: http://en.wikipedia.org/wiki/Cargo_cult_programming
[4] L. KIP WHEELER, Correlation and Causation: http://cnweb.cn.edu/kwheeler/logic_causation.html

Comments

  1. This is a good entry, and it brings up good points. I specifically asked Tom Limoncelli if he thought that his new concept of Design Patterns for System Administrators might lead to Cargo Cult System Administration, when I interviewed him before the LISA conference.

    He said "I look forward to that kind of problem."

    ReplyDelete
  2. I feel these are symptoms of a lack of professionalism. Often as a result of being young, or young in the profession, or sometimes I imagine, I haven't seen it, is laziness.

    I have a question too. What are your strategies for telling your boss that,the problems gone away, you don't know what it was and it may come back. There are times when entropy strikes, and the problem doesn't come back again until months later. There are some legitimate reasons for not finding the root cause, mostly non-reproducibility. What does one do in that situation, what would you do?

    ReplyDelete
  3. Bruno -

    I'd agree, both inexperience and laziness are contributors.

    I'll use language similar to what you just did, making sure that it is understood that the problem went away, but has not been resolved.

    I have some Oracle support incidents open for over a year, because of the infrequent occurrence of the bug and the difficulty of gathering data and troubleshooting.

    ReplyDelete

Post a Comment

Popular posts from this blog

Ad-Hoc Verses Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to …

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:
In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. The…