Skip to main content


Showing posts from November, 2009

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:Failing to understand the difference between necessary and sufficient. A daily backup is necessary, b…

Degraded Operations - Gracefully

From James Hamilton’s Degraded Operations Mode:
“In Designing and Deploying Internet Scale Services I’ve argued that all services should expect to be overloaded and all services should expect mass failures.  Very few do and I see related down-time in the news every month or so.....We want all system to be able to drop back to a degraded operation mode that will allow it to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures.”
I've had high visibility applications fail into 'degraded operations mode'. Unfortunately it has not always been a designed, planned or tested failure mode, but rather a quick reaction to an ugly mess. A graceful degrade plan is better than random degradation, even if the plan something as simple as a manual intervention to disable features in a controlled manner rather than letting then fail in an uncontrolled manner.

On some applications we've been able to plan and execute…

Creative Server Installs - WAN Boot on Solaris (SPARC)

Sun's SPARC servers have the ability to boot a kernel and run an installer across a routed network using only HTTP or HTTPS. On SPARC platforms, the (BIOS|Firmware|Boot PROM) can download a bootable kernel and mini root file system via HTTP/HTTPS, boot from the mini root, and then download and install Solaris. This allows booting a server across a local or wide area network without having any bootable media attached to the chassis. All you need is a serial console, a network connection, an IP address, a default gateway and a web server that's accessible from the bare SPARC server. You set a few variables, then tell it to boot. Yep, it's cool.

From the Boot PROM prompt (the SPARC equivalent of the BIOS)
OK> setenv network-boot-arguments host-ip=client-IP,

OK> boot net -v install

Our base Solaris install is fairly small - on the order of a few hundred mega…

Pandemic Planning – The Dilbert Way

I normally don’t embed things in this blog, but this one is too good to pass up: Deciding who is important is interesting. Senior management wants to see a plan. Middle manager needs to decide who is important. If Middle Manager says only 8 of 20 are critical, what does that say about the other 12?  The only answer that most managers offer is ‘all my employees are critical to the enterprise’.I’m assuming that many or most readers have been a part of some sort of pandemic planning. In our EDU system, the plan isn’t interesting because of the criticality of anything that we do. In a major pandemic, deadlines can be extended, semester start and end dates can be changed, faculty can adapt. It’s interesting because of what our facilities can do. In the rural towns served by many of our colleges, the campus is the best connected building in town. In many cases, our college serves as the local or regional backbone connection point for T1’s from other state agencies, some of which have crit…