Tuesday, November 24, 2009

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia

cargo_cult During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.

The question is, how amusing are we?

We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?

Here’s some common system administration failures that might be ‘cargo cult’:

Failing to understand the difference between necessary and sufficient. A daily backup is necessary, but it may not be sufficient to meet RPO and RTO requirements.

Failing to understand the difference between causation and correlation.[4] Event A may have caused Event B, or some third event may have caused A and B, or the two events may be unrelated and coincidental.

Failing to understand the difference between cause and effect.

Following a security recipe without understanding the risks you are addressing.  If you don't understand how hackers infiltrate your systems and ex-filtrate your data, then your DLP, Firewalls, IDS, SEIM, etc. are cargo cult. You've built the superficial exterior of a system without understanding the underlying substance. If you do understand how your systems get infiltrated, then you'll probably consider simple controls like database and file system permissions and auditing as important as expensive, complex packaged products.

Asserting that [Technology O] or [Platform L] or [Methodology A] is inherently superior to all others and blindly applying it to all problems. When you make such claims, are you applying science or religion?

Systematic troubleshooting is one of the hardest parts of system management and often the first to 'go cargo'. Here’s some examples:

Treating symptoms, not causes. A reboot will not solve your problem. It may make the problem go away for a while, but your problem still exists. You've addressed the symptom of the problem (memory fragmentation, for example), not the cause of the problem (a memory leak, for example).

Troubleshooting without a working hypothesis.

Changing more than one thing at a time while troubleshooting. If you make six changes and the problem went away, how will you determine root cause? Or worse, which of the six changes will cause new problems at a future date?

Making random changes while troubleshooting. Suppose you have a problem with an (application|operating system|database) and you hypothesize that changing a parameter will resolve the problem, so you change the parameter. If the problem reoccurs your hypothesis was wrong, right?

Troubleshooting without measurements or data.

Troubleshooting without being able to recreate the problem.

Troubleshooting application performance without a benchmark to compare performance against. If you don’t know what’s normal, how do you know what’s not normal?

Blaming the (network|firewall|storage) without analysis or hypothesis that points to either. One of our application vendors insisted that the 10mbps of traffic on a 100mbps interface was the cause of the slow application, and we needed to upgrade to GigE. We upgraded it (overnight), just to shut them up. Of course it didn't help. Their app was broke.

Blaming the user or the customer, without an analysis or hypothesis that points to them as the root cause. A better plan would be actually find the problem and fix it.

Declaring that the problem is fixed without determining the root cause. If you don't know the root cause, but the problem appears to have gone away, you haven't solved the problem, you've only observed that the problem went away. Don't worry, it'll come back, just after you’ve written an e-mail to management describing how you’ve “fixed” the problem.

It's easy to fall into cargo cult mode.

Just re-boot it, it'll be fine.


[1] Richard Fenymen, CARGO CULT SCIENCE: http://www.lhup.edu/~DSIMANEK/cargocul.htm
[2]
Mike Speiser, Cargo Cult Managment: http://gigaom.com/2009/06/21/cargo-cult-management/
[3] Wikipedia: Cargo Cult Programming: http://en.wikipedia.org/wiki/Cargo_cult_programming
[4] L. KIP WHEELER, Correlation and Causation: http://cnweb.cn.edu/kwheeler/logic_causation.html

Saturday, November 14, 2009

Degraded Operations - Gracefully

From James Hamilton’s Degraded Operations Mode:
“In Designing and Deploying Internet Scale Services I’ve argued that all services should expect to be overloaded and all services should expect mass failures.  Very few do and I see related down-time in the news every month or so.....We want all system to be able to drop back to a degraded operation mode that will allow it to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures.”

I've had high visibility applications fail into 'degraded operations mode'. Unfortunately it has not always been a designed, planned or tested failure mode, but rather a quick reaction to an ugly mess. A graceful degrade plan is better than random degradation, even if the plan something as simple as a manual intervention to disable features in a controlled manner rather than letting then fail in an uncontrolled manner.

On some applications we've been able to plan and execute graceful service degradation by disabling non-critical features. In one case, we disabled a scheduling widget in order to maintain sufficient headroom for more important functions like quizzing and exams, in other cases, we have the ability to limit the size of shopping carts or restrict financial aid and grade re-calcs during peak load.

Degraded operations isn't just an application layer concept. Network engineers routinely build forms of degraded operations into their designs. Networks have been congested since the day they were invented, and as you'd expect, the technology available for handling degraded operations is very mature. On a typical network, QOS (Quality of Service) policy and configuration is used to maintain critical network traffic and shed non-critical traffic.

As and example, on our shared state wide backbone, we assume that we'll periodically end up in some sort of degraded mode, either because a primary circuit has failed and the backup paths don't have adequate bandwidth, because we experience inbound DOS attacks, or perhaps because we simply don't have adequate bandwidth.  In our case, the backbone is shared by all state agencies, public colleges and universities, including state and local law enforcement, so inter-agency collaboration is necessary when determining what needs to get routed during a degraded state.

A simplified version of the traffic priority on the backbone is:

Highest Priority Router Traffic (BGP, OSPF, etc.)
  Law Enforcement
  Voice
  Interactive Video
  Intra-State Data
Lowest Priority Internet Data

When the network is degraded, we presume that law enforcement traffic should be near the head of the queue. We consider interactive video conferencing to be business critical (i.e. we have to cancel classes when interactive classroom video conferencing is broke), so we keep it higher in the priority order than ordinary data. We have also decided that commodity Internet should be the first traffic to discarded when the network is degraded.

Unfortunately on the part of the application stack that's hardest to scale, the database, there is no equivalent to network QOS or traffic engineering.  I as far as I know, I don't have the ability to tag a query or stored procedure with a few extra bits that tell the database engine to place the query at the head of the work queue, discarding other less important work if necessary. It's not hard to imagine a 'discard eligible' bit that could be set on certain types of database processes or on work submitted by certain clients. The database, if necessary, would discard that work, or place the work in a 'best effort' scheduling class and run if if & when it has free CPU cycles.

If the engineers at the major database vendors would Google 'Weighted Fair Queuing' or 'Weighted Random Early Detect' we might someday see interesting new ways of managing degraded databases.

Creative Server Installs - WAN Boot on Solaris (SPARC)

Sun's SPARC servers have the ability to boot a kernel and run an installer across a routed network using only HTTP or HTTPS. On SPARC platforms, the (BIOS|Firmware|Boot PROM) can download a bootable kernel and mini root file system via HTTP/HTTPS, boot from the mini root, and then download and install Solaris. This allows booting a server across a local or wide area network without having any bootable media attached to the chassis. All you need is a serial console, a network connection, an IP address, a default gateway and a web server that's accessible from the bare SPARC server. You set a few variables, then tell it to boot. Yep, it's cool.

From the Boot PROM prompt (the SPARC equivalent of the BIOS)
OK> setenv network-boot-arguments host-ip=client-IP,
router-ip=router-ip,subnet-mask=mask-value,
hostname=client-name,http-proxy=proxy-ip:port,
file=wanbootCGI-URL

OK> boot net -v install

Our base Solaris install is fairly small - on the order of a few hundred megabytes - so booting across a WAN through a proxy or an SSH tunnel works pretty well. We usually build a temporary SSH tunnel from our management  infrastructure out to another server in the same security container and point the new server at the tunnel end point.

PXE is an attempt to provide similar functionality. It's got a dependency on having DHCP available on the deployed subnet, something which I'm absolutely do not want to enable on non-desktop networks, and it's based on UDP, which makes it slightly less suitable for booting across WAN's where packet loss might be an issue. In any case, we've had enough issues with network boots on x86/x64 platforms that we've pretty much defaulted to using bootable USB's or CD/DVD's for remote installs. That makes an x86/x64 deploy significantly more work effort, as we have to arrange for a bootable USB or CD/DVD's to be delivered on site, or we need to leave bootable media installed in production servers.

Linux has 'BKO', but as far as I can tell, it's still dependent on having either bootable media or PXE.

SPARC's Wan boot is pretty slick, but not as slick as Cisco's AutoInstall. AutoInstall allows you to drop ship an unconfigured router to a remote site. The router will learn it's IP address from it's upstream router via either SLARP or BootP,  automatically download a configuration file, and re-boot with a valid configuration.

A couple of closing thoughts:
  • If the SPARC platform ever goes away, I'll miss it.
  • If router engineers ever decide to build application servers, they'd probably come up with radically new ways of solving old problems. 

Tuesday, November 3, 2009

Pandemic Planning – The Dilbert Way

I normally don’t embed things in this blog, but this one is too good to pass up:

Dilbert.com

Deciding who is important is interesting.

Senior management wants to see a plan. Middle manager needs to decide who is important. If Middle Manager says only 8 of 20 are critical, what does that say about the other 12?  The only answer that most managers offer is ‘all my employees are critical to the enterprise’.

I’m assuming that many or most readers have been a part of some sort of pandemic planning. In our EDU system, the plan isn’t interesting because of the criticality of anything that we do. In a major pandemic, deadlines can be extended, semester start and end dates can be changed, faculty can adapt. It’s interesting because of what our facilities can do. In the rural towns served by many of our colleges, the campus is the best connected building in town. In many cases, our college serves as the local or regional backbone connection point for T1’s from other state agencies, some of which have critical public health, safety or law enforcement roles. I suspect some of those agencies are more important than an exam, lecture or quiz. It’s possible that for us, the critical resources in a pandemic might not have anything to do with education. HVAC, power, and routers might be the top priority.

Then there’s payroll. You’ve got to keep that going no matter what. Sick employees don’t have the energy to mess with bounced checks and overdrawn accounts.