Skip to main content


Showing posts from 2009


The future: Cameras will be ubiquitous. Storage will be effectively infinite. CPU processing power will be effectively infinite. Cameras will detect a broad range of the electromagnetic spectrum. The combination of cameras everywhere and infinite storage will inevitably result in all persons being under surveillance all the time. When combined with infinite processor power and recognition software, it will be impossible for persons to move about society without being observed by their government.All governments eventually are corrupted and when corrupted will misuse the surveillance data. There is no particular reason to think that this is political party or left/right specific. Although it currently is fashionable to think that the right is evil and the left is good, there is no reason to think this will be the case in the future. The only certainty is that party in power will misuse the data to attempt to control their ‘enemies’, whomever they might perceive them be at the time. Sur…

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Degraded Operations - Gracefully

From James Hamilton’s Degraded Operations Mode:
“In Designing and Deploying Internet Scale Services I’ve argued that all services should expect to be overloaded and all services should expect mass failures.  Very few do and I see related down-time in the news every month or so.....We want all system to be able to drop back to a degraded operation mode that will allow it to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures.”
I've had high visibility applications fail into 'degraded operations mode'. Unfortunately it has not always been a designed, planned or tested failure mode, but rather a quick reaction to an ugly mess. A graceful degrade plan is better than random degradation, even if the plan something as simple as a manual intervention to disable features in a controlled manner rather than letting then fail in an uncontrolled manner.

On some applications we've been able to plan and execute…

Creative Server Installs - WAN Boot on Solaris (SPARC)

Sun's SPARC servers have the ability to boot a kernel and run an installer across a routed network using only HTTP or HTTPS. On SPARC platforms, the (BIOS|Firmware|Boot PROM) can download a bootable kernel and mini root file system via HTTP/HTTPS, boot from the mini root, and then download and install Solaris. This allows booting a server across a local or wide area network without having any bootable media attached to the chassis. All you need is a serial console, a network connection, an IP address, a default gateway and a web server that's accessible from the bare SPARC server. You set a few variables, then tell it to boot. Yep, it's cool.

From the Boot PROM prompt (the SPARC equivalent of the BIOS)
OK> setenv network-boot-arguments host-ip=client-IP,

OK> boot net -v install

Our base Solaris install is fairly small - on the order of a few hundred mega…

Pandemic Planning – The Dilbert Way

I normally don’t embed things in this blog, but this one is too good to pass up: Deciding who is important is interesting. Senior management wants to see a plan. Middle manager needs to decide who is important. If Middle Manager says only 8 of 20 are critical, what does that say about the other 12?  The only answer that most managers offer is ‘all my employees are critical to the enterprise’.I’m assuming that many or most readers have been a part of some sort of pandemic planning. In our EDU system, the plan isn’t interesting because of the criticality of anything that we do. In a major pandemic, deadlines can be extended, semester start and end dates can be changed, faculty can adapt. It’s interesting because of what our facilities can do. In the rural towns served by many of our colleges, the campus is the best connected building in town. In many cases, our college serves as the local or regional backbone connection point for T1’s from other state agencies, some of which have crit…

No, we are not running out of bandwidth.

The sky is falling! the sky is falling!

Actually, we’re running out of bandwidth (PDF) (again).

Supposedly all the workers who stay home during the pandemic will use up all the bandwidth in the neighborhood. Let me guess, instead of surfing p0rn and hanging out on Reddit all day at work, they’ll be surfing p0rn and hanging out on Reddit all day from home.

The meat of the study:
“Specifically, at the 40 percent absenteeism level, the study predicted that most users within residential neighborhoods would likely experience congestion when attempting to use the Internet” So what’s the problem?
“..under a cable architecture, 200 to 500 individual cable modems may be connected to a provider’s CMTS, depending on average usage in an area. Although each of these individual modems may be capable of receiving up to 7 or 8 megabits per second (Mbps) of incoming information, the CMTS can transmit a  maximum of only about 38 Mbps.” Ooops – someone is oversubscribed just a tad. At least now we know…

Maintenance, Downtime and Outages

Via Data Center Knowledge - Maintenance, Downtime and Outages a quote from Ken Brill of The Uptime Institute:"The No. 1 reason for catastrophic facility failure is lack of electrical maintenance,” Brill writes. “Electrical connections need to be checked annually for hot spots and then physically tightened at least every three years. Many sites cannot do this because IT’s need for uptime and the facility department’s need for maintenance downtime are incompatible. Often IT wins, at least in the short term. In the long term, the underlying science of materials always wins.”Of course the obvious solution is to have redundant power in the data center & perform the power maintenance on one leg of the power at a time. One of our leased spaces does that. The other of our leased spaces has cooling and power shutdowns often enough that we have a very well rehearsed shutdown & startup plan. The point is well taken though. If you don’t do routine maintenance, you can complain about …

Data Loss, Speculation & Rumors

My head wants to explode. Not because of Microsoft's catastrophic data loss. That's a major failure that should precipitate a significant numbers of RGE's at Microsoft.My head wants to explode because of the speculation, rumors and mis-information surrounding the failure. An example: rumors reported by Daniel Eran Dilger in Microsoft’s Sidekick/Pink problems blamed on dogfooding and sabotage that point to intentional data loss or sabotage."Danger's existing system to support Sidekick users was built using an Oracle Real Application Cluster, storing its data in a SAN (storage area network) so that the information would be available to a cluster of high availability servers. This approach is expressly designed to be resilient to hardware failure."and"the fact that no data could be recovered after the problem erupted at the beginning of October suggests that the outage and the inability to recover any backups were the result of intentional sabotage by a dis…

A Zero Error Policy – Not Just for Backups

In What is a Zero Error Policy, Preston de Guise articulates the need for aggressive follow up and resolution on all backup related errors. It’s a great read.Having a zero error policy requires the following three rules:All errors shall be known. All errors shall be resolved. No error shall be allowed to continue to occur indefinitely. andI personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.I agree. This is a great summary of an important philosophy. Don’t apply this just to backups though. It doesn’t matter what the system is, if you ignore the little warning signs, you’ll eventually end up with a major failure. In system administration, networks and databases, there is no such thing as a ‘transient’ or ‘routine’ error, and ignoring them will not make them go away. Instead, the minor alerts, errors and events will re-occur as critical events at…

Content vs. Style - modern document editing

On ars technica,  Jeremy Reimer writes great thoughts on how we use word processing.

His description of modern document editing:

Go into any office today and you'll find people using Word to write documents. Some people still print them out and file them in big metal cabinets to be lost forever, but again this is simply an old habit, like a phantom itch on a severed limb. Instead of printing them, most people will email them to their boss or another coworker, who is then expected to download the email attachment and edit the document, then return it to them in the same manner. At some point the document is considered "finished", at which point it gets dropped off on a network share somewhere and is then summarily forgotten... We use an application that was optimized to format printed documents in a world where printing is irrelevant, and our ‘document versioning’ is managed by the timestamps on the e-mail messages that we used to ‘collaborate’ on writing the document. Wh…

Infrastructure – Security and Patching

An MRI machine hosting Confliker:“The manufacturer of the devices told them none of the machines were supposed to be connected to the Internet and yet they were […] the device manufacturer said rules from the U.S. Food and Drug Administration required that a 90-day notice be given before the machines could be patched.”Finding an unexpected open firewall hole or a a device that isn’t supposed to be on the Internet is nothing new or unusual. If someone asked “what’s the probability that a firewall has too many holes” or “how likely is it that something got attached to the network that wasn’t supposed to be”, in both cases I’d say the probability is one.Patching a machine that can’t be patched for 90 days after the patch is released is a pain. It’s an exception, and exceptions cost time an money. Patching a machine that isn’t supposed to be connected to the Internet is a pain. I’m assuming that one would need to build a separate ‘dark net’ for the machines. I can’t imagine walking around…

Off Topic: Stadium Construction Resumes (Update)

UPDATE: The Italian authorities have responded to the exposure of the corruption surrounding the construction of the stadium. The Minister of Stadiums has investigated the mismanagement and corruption, issued a report, and taken corrective action. The corrupt and incompetent officials have been identified, prosecuted and punished. Additionally, the authorities have agreed to resolve other outstanding issues that have prevented the completion of the project. PunishmentIn an unusual turn of events, the authorities have issued severe consequences for the perpetrators of the scandal. Here you can see the  gruesome results of this most effective punishment. The statues of the wives of the corrupt officials have been decapitated (NSFW).Shocking though it may be, this extreme punishment, reserved exclusively for the most severe offenses, has a long tradition in Italian culture. Over the millennia, many statues of spouses of famous officials have been similarly mutilated and placed for exhibi…

Off Topic: Stadium Construction Scandal

In the center of Rome are the remnants of large stadium. Tradition tells that the stadium was completed during the period of the Roman Empire and allowed to decay, unmaintained, during the centuries since construction.This photograph, taken with a special filter and timed exactly as the planets Venus and Mars intersected a polyline bounded by the vertices of the tops of the arches and extending in to space, clearly shows that we have been mislead about the true origins of the facility. Careful analysis of the photo shows that the stadium is not a decaying ghost of a once great stadium, but rather it is a uncompleted, scandal ridden construction project gone bad.By normal Italian standards, construction projects of this nature typically take decades. Corruption and mis-management are assumed, delays are inevitable.  But even by those standards, after nearly two millennia the stadium should have been completed. To give a comparative example, the construction on the shopping center in th… – A Crash Course in Failure

One of the things we system managers dread the most is having the power yanked out from under our servers, something that happens far too frequently (and hits the news pretty regularly). Why? Because we don't trust file systems and databases to gracefully handle abnormal termination. We've all had or heard of file system and database corruption just from a simple power outage. Servers have been getting the power yanked out from under them for five decades, and we still don't trust them to crash cleanly? That's ridiculous. Five decades and thousands of programmer-years of work effort ought to have solved that problem by now. It’s not like it’s going to go away anytime in the next five decades.

In A Crash Course in Failure, Craig Stuntz discuses the concept of building crash only software – or software for which a crash and a normal shutdown are functionally equivalent.

“Hardware will fail. Software will crash. Those are facts of life.”
"…if you believe yo…

Error Handling – an Anecdote

A long time ago, shortly after the University I was attending migrated students off of punch cards, I had an assignment to write a batch based hotel room reservation program. We were on top of the world - we had dumb terminals instead of punch cards. The 9600 baud terminals were reserved for professors, but if you got lucky, [WooHoo!] you could get one of the 4800 baud terminals instead of a 2400 or 1200 baud DECwriters. The instructors mantra - I'll never forget - is that students need to learn how to write programs that gracefully handle errors. 'You don't want an operator calling at 2am telling you your program failed. That sucks.' He was a part time instructor and full time programmer who got tired of getting woke up, and he figured that we needed our sleep, so he made robustness part of his grading criteria.Here's how he made that stick in my mind for 30 years:  When the assignment was handed to us, the instructor gave us the location of sample input data file…

Sometimes Hardware is Cheaper than Programmers

In Hardware is Expensive, Programmers are Cheap II I promised that I’d give an example of a case where hardware is cheap compared to designing and building a more efficient application. That post pointed out a case where a relatively small investment in program optimization would have paid itself back by dramatic hardware savings across a small number of the software vendors customers.Here’s an example of the opposite.Circa 2000/2001 we started hosting an ASP application running on x86 app servers with a SQL server backend. The hardware was roughly 1Ghz/1GB per app server. Web page response time was a consistent 2000ms. Each app server could handle no more than a handful of page views per second.By 2004 or so, application utilization grew enough that the page response time and the scalability (page views per server per second) were both considered unacceptable. We did a significant amount of investigation into the application, focusing first on the database, and then on the app server…

Cisco IOS hints and tricks: What went wrong: end-to-end ATM

I enjoy reading Ivan Pepelnjak's Cisco IOS hints and tricks blog. Having been a partner in a state wide ATM wide area network that implemented end to end RSVP, his thoughts on What went wrong: end-to-end ATM are interesting.

I can' figure out how to leave a comment on his blog though, so I'll comment here:
I'd add a couple more reasons for ATM's failure.

(1) Cost. Host adapters, switches and router interfaces were more expensive. ATM adapters used more CPU, so larger routers were needed for a given bandwidth.

(2) Complexity, especially on the LAN side. (On a WAN, ATM isn't necessarily more complex than MPLS for a given functionality. It might even be simpler).

(3) 'Good enough' QOS on ethernet and IP routing. Inferior to ATM? Yes. Good enough? Considering the cost and complexity of ATM, yes.

Ironically, core IP routers maintain a form of session state anyway (CEF). On an ATM wide are a network, H.323 video endpoints would connect to a gatekeeper and re…

No, I Don’t Want iTunes Installed. You can quit asking.

I don’t like software vendors that try to sneak software onto my computers. I really don’t like software vendors that don’t pay attention to my requests to not run in the background at startup. This evening I came home and saw the Apple Software Update popped up on my Vista desktop:Problem one: iTunes is check marked by default. I don’t want iTunes. I don’t need iTunes. And I don’t like having software vendors try to sneak  software onto my computers.  This isn’t unique to Vista. Apple does the same thing on OS X. It’s annoying enough that I’ll probably uninstall Quicktime and throw away the $29 that I paid for it.Problem two: I specifically instructed Apple’s Quicktime to not automatically update, and I specifically have disabled the Quicktime service from running at startup, but somehow it ran anyway.I’ve also checked the Software Explorer in Windows Defender and the ‘Run’ registry keys for Apple related startup programs & didn’t find any. I’d sure like to know what’s triggering…

Availability & SLA’s – Simple Rules

From theDailyWtf, a story about availability & SLA’s that’s worth a read about an impossible availability/SLA conundrum. It’s a good lead in to a couple of my rules of thumb. “If you add a nine to the availability requirement, you’ll add a zero to the price.”In other words, to go from 99.9% to 99.99% (adding a nine to the availability requirement), you’ll increase the cost of the project by a factor of 10 (adding a zero to the cost).There is a certain symmetry to this. Assume that it’ll cost 20,000 to build the system to support three nines, then:99.9 = 20,000
99.99 = 200,000
99.999 = 2,000,000
The other rule of thumb that this brings up is Each technology in the stack must be designed for one nine more than the overall system availability.This one is simple in concept. If the whole system must have three nines, then each technology in the stack (DNS, WAN, firewalls, load balancers, switches, routers, servers, databases, storage, power, cooling, etc.) must be designed for four nines.…

NAC or Rootkit - How Would I know?

I show up for a meeting, flip open my netbook and start looking around for a wireless connection. The meeting host suggests an SSID. I attach to the network and get directed to a captive portal with an ‘I agree’ button. I press the  magic button an get a security warning dialogue. It looks like the network is NAC’d. You can’t tell that from the dialogue though. ‘Impluse Point LLC’ could be a NAC vendor or a malware vendor. How would I know? If I were running a rouge access point and wanted to install a root kit, what would it take to get people to run the installer?  Probably not much. We encourage users  to ignore security warnings.Anyway – it was amusing. After I switched to my admin account and installed the ‘root kit’ service agent and switched back to my normal user, I got blocked anyway. I’m running Windows 7 RC without anti-virus. I guess NAC did what it was supposed to do. It kept my anti-virus free computer off the network.I’d like someone to build a shim that fakes NAC into …

Consulting Fail, or How to Get Removed from my Address Book

Here’s some things that consultants do that annoy me. Some consultants brag about who is backing their company or whom they claim as their customers. I’ve never figured that rich people are any smarter than poor people so I’m not impressed by consultants who brag about who is backing them or who founded their company. Recent ponzi and hedge fund implosions confirm my thinking. And it seems like the really smart people who invented technology 1.0 and made a billion are not reliably repeating their success with technology 2.0. It happens, but not predictably, so mentioning that [insert famous web 1.0 person here] founded or is backing your company is a waste of a slide IMHO. I’m also not impressed by consultants who list [insert Fortune 500 here] as their clients. Perhaps [insert Fortune 500 here] has a world class IT operation and the consultant was instrumental in making them world class. Perhaps not. I have no way of knowing. It’s possible that some tiny corner of [insert Fortune 500…

Your Application is a Rotting Old Shack, Now What?

In response to A Shack in the Woods, Crumbling at the Core, colleague Jim Graves commented:
“…it only works if application owners are like long-term homeowners, not house flippers.”Good point. Who cares if the shack gets a cheap paint job instead of a foundation and a comprehensive re-modeling? Will the business owner know or care? Do the contractors you hired care? Are you going to be around long enough to care? Are you and your employees, managers and consultants acting as house job flippers, painting over the flaws so you can update your resumes, take the profits and move on?
Jim asks:
“Are long-term employees more likely to care about problems that may happen five years from now? Are Highly Paid Consultants much less likely to?” Good question. Suppose that I want to fix the shack. Maybe I’m tired of having to empty the buckets that catch the drips from the roof (or restart the J2EE app that runs itself out of database connections a couple times a week). If this repair is to be any…

Resume Driven Design

Sam Buchanan, a long time colleague, commenting on a consultants design for a small web application:“I'm telling you: this app reeks of resume-driven design” In ‘Your Application is a Rotting old Shack’ I whined mused about applications that get face lifts while core problems get ignored. Let’s assume for a moment that business units finally figure out that their apps have a crumbling foundation and need structural overhauls. Assuming that internal resources don’t exist, how do we know that the consultants and contractors that we hire to design and build our systems aren’t more interested in building their resumes than our applications?I’d like to think that I would be able to tell if a consultant tried to recommend an architecture or design that exists more to pad their resume than solve my problems. It’s probably not that straight forward though. Consultants have motivations that may intersect with your needs, or they may have motivations that significantly deviate from what you…

A Mime in a Box

Picking up on a thread by Andy the IT Guy, Which of these things is not like the other?A developer who doesn’t understand the databases, networks or firewalls. A system manager or DBA who doesn’t understand applications, networks and firewalls. A firewall or network administrator who doesn’t understand operating systems and applications. A mime in a box. Trick question. They’re the same. The mime’s box is imaginary, as are the cross disciplinary restrictions that we place on developers, system and network administrators. In the example from Andy’s post, the developer didn’t understand the difference between an app installed on a desktop and an app installed on a server. Similarly, non-network people often don’t understand the critical difference between source and destination when an app server connects to a database. For example, I often see this diagram:showing an application updating a database, when from a network point of view, what we really need to see is:showing the applicatio…