Last In - First Out: 2009

Surveillance

The future:

Cameras will be ubiquitous. Storage will be effectively infinite. CPU processing power will be effectively infinite. Cameras will detect a broad range of electromagnetic spectrum. The combination of cameras everywhere and infinite storage will inevitably result in all persons being under surveillance all the time. When combined with infinite processor power and recognition software, it will be impossible for persons to move about society without being observed by their government.

All governments eventually are corrupted and when corrupted will misuse the surveillance data. There is no particular reason to think that this is political party or left/right specific. Although it currently is fashionable to think that the right is evil and the left is good, there is no reason to think this will be the case in the future. The only certainty is that the party in power will misuse the data to attempt to control their ‘enemies’, whomever they might perceive them be at the time.

Surveillance advocates claim that cameras are simply an extension of law enforcement's eyes and therefore are not a significant new impingement on personal freedom.

I disagree.

Here’s how I’d build a surveillance system that allows the use of technology to maximize law enforcement effectiveness yet provide reasonable controls on the use of the surveillance against the population as a whole.

The cameras are directly connected to a control room of some sort. The control room is monitored by sworn, trained law enforcement officers. The officers watch the monitors.

So far this is ordinary surveillance. Here’s how I’d protect individual privacy:

The locations of the cameras are well known.
All cameras record to volatile memory only. The capacity of the volatile memory is a small, on the order of one hour or so. Unless specific action is taken, all recorded data more than one hour old is automatically and irretrievably lost. A ring buffer of some sort.
If a sworn officer sees a crime, the sworn officer may switch specific cameras to non-volatile storage. The action to switch a camera from volatile to non volatile storage is deliberate and only taken when an officer sees specific events that constitute probable cause that a crime is being committed, or when a crime has been reported to law enforcement. Each instance of the use of non-volatile storage is recorded, documented and discoverable by the general public using some well defined process.
Once a camera is switched to non-volatile storage, it automatically reverts to volatile storage after a fixed time period (one hour, for example), unless the sworn officer repeatedly toggles the non-volatile switch on the camera.
The non-volatile storage automatically expires after a fixed amount of time (24 hours, for example). If law enforcement believes that a crime has occurred and that the video will be evidence in the crime, law enforcement obtains a court order to retain the video evidence and move it to permanent storage. The court order must be for a specific crime and must name specific cameras and times.
When a court so orders, the video is moved from non-volatile storage to whatever method law enforcement uses for retaining and managing evidence. If the court order is not obtained within the non-volatile expiration period, the video is irretrievably deleted. If the court order is obtained, the video becomes subject to whatever rules govern evidence in the legal jurisdiction of the cameras.

In the case of a 9/11 or 7/7 type of event, the officer would simply toggle all cameras non-volatile mode and would continue to re-enable non-volatile mode every hour for as long as necessary (days, if necessary). The action of toggling the cameras would again be recorded, documented and discoverable.

To prevent the system from being subverted by corrupt law enforcement, (think J Edgar Hoover and massive illegal surveillance) the systems would be physically sealed, the software and storage for the non-volatile and volatile storage would be unavailable to law enforcement.

There would be some form of crypto/hash/signing that enables tracking the recordings back to a specific camera and assures that the recordings have not been altered by law enforcement.

The key concepts are:

the system defaults to automatically destroying all recordings automatically.
a sworn officer of the law must observe an event before triggering non-volatile storage.
specific actions are required to store the recordings
those actions are logged, documented and discoverable.
a court action of some sort is required for storage of any recording beyond a short period of time.
the system would be tamper-proof. The act of law enforcement tampering with the systems to defeat the privacy controls would be a felony.
the system would maintain the integrity of the recordings for as long as the video exists.

And most importantly, the software would be open source.

Cargo Cult System Administration

Cargo Cult:

…imitate the superficial exterior of a process or system without having any understanding of the underlying substance

--Wikipedia

During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers and wait for cargo to fall from the sky.
The question is, how amusing are we?

We have cargo cult science[^1], cargo cult management^[2], cargo cult programming^[3], how about cargo cult system management?

Here’s some common system administration failures that might be ‘cargo cult’:

Failing to understand the difference between necessary and sufficient. A daily backup is necessary, but it may not be sufficient to meet RPO and RTO requirements.
Failing to understand the difference between causation and correlation.^[4] Event A may have caused Event B, or some third event may have caused A and B, or the two events may be unrelated and coincidental.
Failing to understand the difference between cause and effect.
Following a security recipe without understanding the risks you are addressing. If you don't understand how hackers infiltrate your systems and ex-filtrate your data, then your DLP, Firewalls, IDS, SEIM, etc. are cargo cult. You've built the superficial exterior of a system without understanding the underlying substance. If you do understand how your systems get infiltrated, then you'll probably consider simple controls like database and file system permissions and auditing as important as expensive, complex packaged products.
Asserting that [Technology O] or [Platform L] or [Methodology A] is inherently superior to all others and blindly applying it to all problems. When you make such claims, are you applying science or religion?

Systematic troubleshooting is one of the hardest parts of system management and often the first to 'go cargo'. Here’s some examples:

Treating symptoms, not causes. A reboot will not solve your problem. It may make the problem go away for a while, but your problem still exists. You've addressed the symptom of the problem (memory fragmentation, for example), not the cause of the problem (a memory leak, for example).
Troubleshooting without a working hypothesis.
Changing more than one thing at a time while troubleshooting. If you make six changes and the problem went away, how will you determine root cause? Or worse, which of the six changes will cause new problems at a future date?
Making random changes while troubleshooting. Suppose you have a problem with an (application|operating system|database) and you hypothesize that changing a parameter will resolve the problem, so you change the parameter. If the problem reoccurs your hypothesis was wrong, right?
Troubleshooting without measurements or data.
Troubleshooting without being able to recreate the problem.
Troubleshooting application performance without a benchmark to compare performance against. If you don’t know what’s normal, how do you know what’s not normal?
Blaming the (network|firewall|storage) without analysis or hypothesis that points to either. One of our application vendors insisted that the 10mbps of traffic on a 100mbps interface was the cause of the slow application, and we needed to upgrade to GigE. We upgraded it (overnight), just to shut them up. Of course it didn't help. Their app was broke.
Blaming the user or the customer, without an analysis or hypothesis that points to them as the root cause. A better plan would be actually find the problem and fix it.
Declaring that the problem is fixed without determining the root cause. If you don't know the root cause, but the problem appears to have gone away, you haven't solved the problem, you've only observed that the problem went away. Don't worry, it'll come back, just after you’ve written an e-mail to management describing how you’ve “fixed” the problem.

It's easy to fall into cargo cult mode.
Just re-boot it, it'll be fine.

[1] Richard Fenymen, CARGO CULT SCIENCE: http://www.lhup.edu/~DSIMANEK/cargocul.htm
[2] Mike Speiser, Cargo Cult Management: http://gigaom.com/2009/06/21/cargo-cult-management/
[3] Wikipedia: Cargo Cult Programming: http://en.wikipedia.org/wiki/Cargo_cult_programming
[4] L. KIP WHEELER, Correlation and Causation: http://cnweb.cn.edu/kwheeler/logic_causation.html

Degraded Operations - Gracefully

From James Hamilton’s Degraded Operations Mode:

“In Designing and Deploying Internet Scale Services I’ve argued that all services should expect to be overloaded and all services should expect mass failures. Very few do and I see related down-time in the news every month or so.....We want all system to be able to drop back to a degraded operation mode that will allow it to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures.”

I've had high visibility applications fail into 'degraded operations mode'. Unfortunately it has not always been a designed, planned or tested failure mode, but rather a quick reaction to an ugly mess. A graceful degrade plan is better than random degradation, even if the plan is something as simple as a manual intervention to disable features in a controlled manner rather than letting them fail in an uncontrolled manner.

On some applications we've been able to plan and execute graceful service degradation by disabling non-critical features. In one case, we disabled a scheduling widget in order to maintain sufficient headroom for more important functions like quizzing and exams, in other cases, we have the ability to limit the size of shopping carts or restrict financial aid and grade re-calcs during peak load.

Degraded operations isn't just an application layer concept. Network engineers routinely build forms of degraded operations into their designs. Networks have been congested since the day they were invented, and as you'd expect, the technology available for handling degraded operations is very mature. On a typical network, QOS (Quality of Service) policy and configuration is used to maintain critical network traffic and shed non-critical traffic.

As an example, on our shared state wide backbone, we assume that we'll periodically end up in some sort of degraded mode, either because a primary circuit has failed and the backup paths don't have adequate bandwidth, because we experience inbound DOS attacks, or perhaps because we simply don't have adequate bandwidth. In our case, the backbone is shared by all state agencies, public colleges and universities, including state and local law enforcement, so inter-agency collaboration is necessary when determining what needs to get routed during a degraded state.

A simplified version of the traffic priority on the backbone is:

Highest Priority	Router Traffic (BGP, OSPF, etc.)
	Law Enforcement
	Voice
	Interactive Video
	Intra-State Data
Lowest Priority	Internet Data

When the network is degraded, we presume that law enforcement traffic should be near the head of the queue. We consider interactive video conferencing to be business critical (i.e. we have to cancel classes when interactive classroom video conferencing is broke), so we keep it higher in the priority order than ordinary data. We have also decided that commodity Internet should be the first traffic to discarded when the network is degraded.

Unfortunately on the part of the application stack that's hardest to scale, the database, there is no equivalent to network QOS or traffic engineering. I as far as I know, I don't have the ability to tag a query or stored procedure with a few extra bits that tell the database engine to place the query at the head of the work queue, discarding other less important work if necessary. It's not hard to imagine a 'discard eligible' bit that could be set on certain types of database processes or on work submitted by certain clients. The database, if necessary, would discard that work, or place the work in a 'best effort' scheduling class and run if if & when it has free CPU cycles.

If the engineers at the major database vendors would Google 'Weighted Fair Queuing' or 'Weighted Random Early Detect' we might someday see interesting new ways of managing degraded databases.

Creative Server Installs - WAN Boot on Solaris (SPARC)

Sun's SPARC servers have the ability to boot a kernel and run an installer across a routed network using only HTTP or HTTPS. On SPARC platforms, the (BIOS|Firmware|Boot PROM) can download a bootable kernel and mini root file system via HTTP/HTTPS, boot from the mini root, and then download and install Solaris. This allows booting a server across a local or wide area network without having any bootable media attached to the chassis. All you need is a serial console, a network connection, an IP address, a default gateway and a web server that's accessible from the bare SPARC server. You set a few variables, then tell it to boot. Yep, it's cool.

Pandemic Planning – The Dilbert Way

This one is too good to pass up:

https://dilbert.com/strip/2009-10-24

Deciding who is important is interesting.

Senior management wants to see a plan. Middle manager needs to decide who is important. If Middle Manager says only 8 of 20 are critical, what does that say about the other 12? The only answer that most managers offer is ‘all my employees are critical to the enterprise’.

No, we are not running out of bandwidth.

The sky is falling! the sky is falling!

Actually, we’re running out of bandwidth (PDF) (again).

Supposedly all the workers who stay home during the pandemic will use up all the bandwidth in the neighborhood. Let me guess, instead of surfing p0rn and hanging out on Reddit all day at work, they’ll be surfing p0rn and hanging out on Reddit all day from home.

The meat of the study:

“Specifically, at the 40 percent absenteeism level, the study predicted that most users within residential neighborhoods would likely experience congestion when attempting to use the Internet”

So what’s the problem?

“..under a cable architecture, 200 to 500 individual cable modems may be connected to a provider’s CMTS, depending on average usage in an area. Although each of these individual modems may be capable of receiving up to 7 or 8 megabits per second (Mbps) of incoming information, the CMTS can transmit a maximum of only about 38 Mbps.”

Ooops – someone is oversubscribed just a tad. At least now we know how much.

Wait – 40% of the working population is at home, working or sick, bored to death, surfing the web. And they’ll be transferring large documents! Isn’t that what we call the weekend? So is the Internet broke on weekends? If so, I never noticed.

What about evenings? We have a secondary utilization peak on our 24/7 apps around 10pm local time. That peak is almost exclusively people at home, working. Presumably this new daytime peak will dwarf the late evening peak?

Here’s a reason to panic. If it gets bad enough, the clowns who threw away ten trillion dollars of other peoples money on math they didn’t understand will not be able to throw away other peoples money while telecommuting:

“If several of these large firms were unable or unwilling to operate, the markets might not have sufficient trading volume to function in an orderly or fair way.”

My thought? Slow them down. When they flew at Mach 2, they smacked into a wall and took us with them.

Got to love this:

“Providers identified one technically feasible alternative that has the potential to reduce Internet congestion during a pandemic, but raised concerns that it could violate customer service agreements and thus would require a directive from the government to implement.”

Provider: “Yah know? I’d be cool if we could get the government to make us throttle that bandwidth. Yep, that’d be cool.”

How about Plan B – shut off streaming video:

“Shutting down specific Internet sites would also reduce congestion, although many we spoke with expressed concerns about the feasibility of such an approach.”

Wait – isn’t that what CDN’s are for? The Akamai cache at the local ISP has the content, all that matters is the last mile, right? For reference, with a few hundred thousand students working hard a surfing the web all day, we slurp up about 1/3 of our Internet bandwidth from the local Akamai rack directly attached to the Internet POP, (settlement free) and another 1/3 by peering directly with big content providers (also settlement free).

I’m not worried about bandwidth. If any of this were serious, we’d have been able to detect the effect of 10% unemployment on home bandwidth. Or the Internet would have broke during the 2008 election. Or what-his-names sudden death.

A more interesting potential outcome of a significant pandemic would be the gradual degradation of services as the technical people get sick and/or stay home with their families. I’d expect a significantly longer MTTR on routine outages during a real pandemic.

Would the cable tech show up at my house today, with two people flu’d out? Not if she’s smart.

Note:

The report is oriented toward the financial sector. The trades must go on. There are quarterly bonuses to be had.
The DHS commented on the draft in the appendices. They’ve attempted to inject a bit of rationality into the report.

Maintenance, Downtime and Outages

Via Data Center Knowledge - Maintenance, Downtime and Outages a quote from Ken Brill of The Uptime Institute:

"The No. 1 reason for catastrophic facility failure is lack of electrical maintenance,” Brill writes. “Electrical connections need to be checked annually for hot spots and then physically tightened at least every three years. Many sites cannot do this because IT’s need for uptime and the facility department’s need for maintenance downtime are incompatible. Often IT wins, at least in the short term. In the long term, the underlying science of materials always wins.”

Data Loss, Speculation & Rumors

My head wants to explode. Not because of Microsoft's catastrophic data loss. That's a major failure that should precipitate a significant number of RGE's at Microsoft.

My head wants to explode because of the speculation, rumors and misinformation surrounding the failure. An example: rumors reported by Daniel Eran Dilger in Microsoft’s Sidekick/Pink problems blamed on dogfooding and sabotage that point to intentional data loss or sabotage.

"Danger's existing system to support Sidekick users was built using an Oracle Real Application Cluster, storing its data in a SAN (storage area network) so that the information would be available to a cluster of high availability servers. This approach is expressly designed to be resilient to hardware failure."

and

"the fact that no data could be recovered after the problem erupted at the beginning of October suggests that the outage and the inability to recover any backups were the result of intentional sabotage by a disgruntled employee."

and

"someone with access to the servers at the datacenter must have inserted a time bomb to wipe out not just all of the data, but also all of the backup tapes, and finally, I suspect, reformatting the server hard drives so that the service itself could not be restarted with a simple reboot (and to erase any traces of the time bomb itself)."

Intentional sabotage?

How about some rational speculation, without black helicopters & thermite charges? How about simple mis-management? How about failed backup jobs followed by a failed SAN upgrade? I've had both backup and SAN failures in the past & fully expect to continue to have both types of failures again before I retire. Failed backups are common. Failed SAN's and failed SAN upgrades are less common, but likely enough that they must be accounted for in planning.

Let's assume as Dan speculates, it’s an Oracle RAC cluster. If all the RAC data is on a single SAN, with no replication or backups to separate media on separate controllers, then a simple human error configuring the SAN can easily result in spectacular failure. If there is no copy of the Oracle data on separate media & separate controllers on a separate server, you CAN loose data in a RAC cluster. RAC doesn't magically protect you from SAN failure. RAC doesn't magically protect you from logical database corruption or DBA fat fingering. All the bits are still stored on a SAN, and if the SAN fails or the DBA fails, the bits are gone. Anyone who doesn't think that's true hasn't driven a SAN or a database for a living.

We've owned two IBM DS-4800's for the last four years and have had three controller related failures that could have resulted in data loss had we not taken the right steps at the right time (with IBM advanced support looking over our shoulders). A simple thing like a mismatched firmware somewhere between the individual drives and the controllers or the HBA's and the operating system has the potential to cause data loss. Heck - because SAN configs can be stored on the individual drives themselves, plugging a used drive with someone else's SAN config into your SAN can cause SAN failure - or at least catastrophic, unintentional SAN reconfiguration.

I've got an IBM doc (sg246363) that says:

"Prior to physically installing new hardware, refer to the instructions in IBM TotalStorage DS4000 hard drive and Storage Expansion Enclosure Installation and Migration Guide, GC26-7849, available at: [...snip...] Failure to consult this documentation may result in data loss, corruption, or loss of availability to your storage."

Does that imply that plugging the wrong thing into the wrong place at the wrong time can 'f up an entire SAN? Yep. It does, and from what the service manager for one of our vendors told me, it did happen recently – to one of the local Fortune 500’s.

If you don't buy that speculation, how about a simple misunderstanding between the DBA and the SAN team?

DBA: "So these two LUN's are on separate controllers, right?

SAN: "Yep."

Does anyone work at a place where that doesn't happen?

As for the ‘insider’ quote: "I don't understand why they would be spending any time upgrading stuff", a simple explanation would be that somewhere on the SAN, out of the hundreds of attached servers, a high-profile project needed the latest SAN controller firmware or feature. Let's say, for example, that you wanted to plug a shiny new Windows 2008 server into the fabric and present a LUN from and older SAN. You'd likely have to upgrade the SAN firmware. To have a supported configuration, there is a fair chance that you'd have to upgrade HBA firmware and HBA drivers on all SAN attached servers. The newest SAN controller firmware that's required for 'Project 2008' then forces an across the board upgrade of all SAN attached servers, whether they are related to ‘Project 2008’ or not. It's not like that doesn't happen once a year or so, and it’s the reason that SAN vendors publish ‘compatibility matrixes’.

And upgrades sometimes go bad.

We had an HP engineer end up hospitalized a couple days before a major SAN upgrade. A hard deadline prevented us from delaying the upgrade. The engineer's manager flew in at the last minute, reviewed the docs, did the upgrade - badly. Among other f'ups, he presented a VMS LUN to a Windows server. The Windows server touched the LUN & scrambled the VMS file system. A simple error, a catastrophic result. Had that been a RAC LUN, the database would have been scrambled. It happened to be a VMS LUN that was recoverable from backups, so we survived.

Many claim this is a cloud failure. I don't. As far as I can see, it's a service failure, plain and simple, independent of how it happens to be hosted. If the data was stored on an Amazon, Google or Azure cloud, and if the Amazon, Google or Azure cloud operating system or storage software scrambled and/or lost the data, then it'd be a cloud failure. The data appears to have been on ordinary servers in an ordinary database on a ordinary SAN.

That makes this an ordinary failure.

A Zero Error Policy – Not Just for Backups

In What is a Zero Error Policy, Preston de Guise articulates the need for aggressive follow up and resolution on all backup related errors. It’s a great read.

Having a zero error policy requires the following three rules:

All errors shall be known.

All errors shall be resolved.

No error shall be allowed to continue to occur indefinitely.

and

I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.

I agree. This is a great summary of an important philosophy.

Don’t apply this just to backups though. It doesn’t matter what the system is, if you ignore the little warning signs, you’ll eventually end up with a major failure. In system administration, networks and databases, there is no such thing as a ‘transient’ or ‘routine’ error, and ignoring them will not make them go away. Instead, the minor alerts, errors and events will re-occur as critical events at the worst possible time. If you don’t follow up on ‘routine’ errors, find their root cause and eliminate them, you’ll never have the slightest chance of improving the security, availability and performance of your systems.

I could list an embarrassing number of situations where I failed to follow up on a minor event and had it cascade to a major, service affecting event. Here’s a few examples:

A strange undecipherable error when plugging a disk into an IBM DS4800 SAN. IBM didn’t think it was important. A week later I had a DS4800 with a hung mirrored disk set & a 6 hour production outage.
A pair of internal disks on a new IBM 16 CPU x460 that didn’t perform consistently in a pre-production test with IoZone. During some tests, the whole server would hang for minute & then recover. IBM couldn’t replicate the problem. Three months later the drives on that controller started ‘disappearing’ at random intervals. After three more months, a hundred person-hours of messing around, uncounted support calls and a handful of on site part-swapping fishing expeditions, IBM finally figured out that they had a firmware bug in their OEM’d Adapted RAID controllers.
An unfamiliar looking error in on a DS4800 controller at 2am. Hmmm… doesn’t look serious, lets call IBM in the morning. At 6am, controller zero dropped all it’s LUN’s and the redundant controller claimed cache consistency errors. That was an 8 hour outage.

Just so you don’t think I’m picking on IBM:

An HA pair of Netscaler load balancers that occasionally would fail to sync their configs. During a routine config change a month later, the secondary crashed and the primary stopped passing traffic on one of the three critical apps that it was front-ending. That was a two hour production outage.
A production HP file server cluster that was fiber channel attached to both a SAN and a tape library would routinely kick out a tapes and mark them bad. Eventually it happened often enough that I couldn’t reliably back up the cluster. The cluster then wedged itself up a couple times and caused production outages. The root cause? An improperly seated fiber channel connector. The tape library was trying really, really hard to warn me.

In each case there was plenty of warning of the impending failure and aggressive troubleshooting would have avoided an outage. I ignored the blinking idiot lights on the dashboard and kept driving full speed.

I still end up occasionally passing over minor errors, but I’m not hiding my head in the sand hoping it doesn’t return. I do it knowing that the error will return. I’m simply betting that when it does, I’ll have better logging, better instrumentation, and more time for troubleshooting.

Content vs. Style - modern document editing

On ars technica, Jeremy Reimer writes great thoughts on how we use word processing.

His description of modern document editing:

Go into any office today and you'll find people using Word to write documents. Some people still print them out and file them in big metal cabinets to be lost forever, but again this is simply an old habit, like a phantom itch on a severed limb. Instead of printing them, most people will email them to their boss or another coworker, who is then expected to download the email attachment and edit the document, then return it to them in the same manner. At some point the document is considered "finished", at which point it gets dropped off on a network share somewhere and is then summarily forgotten...

"For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."

Personal observations on the reliability of the Shuttle

by R. P. Feynman

Infrastructure – Security and Patching

An MRI machine hosting Confliker:

“The manufacturer of the devices told them none of the machines were supposed to be connected to the Internet and yet they were […] the device manufacturer said rules from the U.S. Food and Drug Administration required that a 90-day notice be given before the machines could be patched.”

Finding an unexpected open firewall hole or a a device that isn’t supposed to be on the Internet is nothing new or unusual. If someone asked “what’s the probability that a firewall has too many holes” or “how likely is it that something got attached to the network that wasn’t supposed to be”, in both cases I’d say the probability is one.

Patching a machine that can’t be patched for 90 days after the patch is released is a pain. It’s an exception, and exceptions cost time an money.

Patching a machine that isn’t supposed to be connected to the Internet is a pain. I’m assuming that one would need to build a separate ‘dark net’ for the machines. I can’t imagine walking around with a CD and patching them.

Locating and identifying every operating system instance in a large enterprise is difficult, especially when the operating systems are packaged as a unit with an infrastructure device of some sort. Assuring that they all are patched is non-trivial. When vendors package an operating system (Linux, Windows) in with a device, they rarely acknowledge that you or they need to harden, patch, and update that operating system.

Major vendors have Linux and Windows devices that they refer to as ‘SAN Management Appliances’, ‘Enterprise Tape Libraries’, and ‘Management Consoles’. They rarely acknowledge that the underlying OS needs to be hardened and patched, and sometimes even prohibit customer hardening and patching. The vendor supplies a ‘turnkey system’ or ‘appliance’ and fails to manage the patches on the same schedule as the OS that they embedded into their ‘appliance’.

This isn’t a Microsoft problem. Long before Windows was considered fit to be used for infrastructure devices (building controls, IVR, HVAC, etc) hackers were routinely root kitting the Solaris and Linux devices that were running the infrastructure. We tend to forget that though.

Off Topic: Stadium Construction Resumes (Update)

UPDATE: The Italian authorities have responded to the exposure of the corruption surrounding the construction of the stadium. The Minister of Stadiums has investigated the mismanagement and corruption, issued a report, and taken corrective action. The corrupt and incompetent officials have been identified, prosecuted and punished. Additionally, the authorities have agreed to resolve other outstanding issues that have prevented the completion of the project.

Punishment

In an unusual turn of events, the authorities have issued severe consequences for the perpetrators of the scandal. Here you can see the gruesome results of this most effective punishment. The statues of the wives of the corrupt officials have been decapitated (NSFW).

Shocking though it may be, this extreme punishment, reserved exclusively for the most severe offenses, has a long tradition in Italian culture. Over the millennia, many statues of spouses of famous officials have been similarly mutilated and placed for exhibition in great air conditioned halls. In many cases admission is charged for viewing the mutilated statues.

Improved Security

To prevent a reoccurrence of the material thefts, guards have been hired to prevent the misappropriation of construction materials. In this exclusive undercover photo, taken at great risk by staff reporters of this publication, guards in uniform are secretly training for their role in the protection of the project. The Minister of Stadiums has assured us that the guards have been trained to defend the construction materials using all means available. Unnamed officials have confirmed that in extreme cases, guards may be authorized to defend the construction materials with realistic plastic swords.

To supplement the guards, the Ministry has installed special road materials on all paths leading from the construction site. These roads are cleverly designed to impart a vibration of a specific destructive frequency and amplitude to any ~~Chrysler~~ Fiat that passes over the roads. The ~~Chrysler~~ Fiats are not expected to traverse more than half way up the path before requiring the intervention of a mechanic, giving the authorities sufficient time to apprehend the thieves.

Resumption of Construction

The security measures have been deemed sufficient to permit the resumption of construction on the stadium. New groups of highly skilled workers are restarting the project. These workers have been specially recruited from all parts of the world, arriving daily in great numbers by airplane, bus, car and train. As you can see in the photograph to the right, the Italian authorities have spared no expense in this area.

As shown in the photo below, the new workers have made great progress in the very short amount of time since the resumption of construction. Scaffolding has already been erected and workers have restarted construction on the western facade.

Authorities have assured this reporter that the problems have been resolved and are optimistic that the new workers will make rapid progress on the stadium.

Enhancements to the Original Design

Additionally, according to stadium officials, significant improvements will be made to the original design. The changes are specifically designed to improve the usability and assure a steady revenue stream. Assurances have been given that the additions to the project scope will not increase the cost of the project nor delay its completion.

In the first enhancement, a new steel playing field is being built to replace the original wood and masonry field. The new field is a significant improvement to the original field and is expected to last as long as the construction project. The new playing field, shown here, is already partially constructed. Authorities claim the new field will result in less severe injuries and fewer weather related event cancellations. Additionally, the new playing field will permit the display of banner ads directly on the surface of the field.

As with modern stadiums, special seating will be built for sponsors and executives. A premium will be charged for the seating and the proceeds will be used to finance the completion of the stadium. The special seating is expected to generate significant revenue toward the completion of the stadium. Pictured here is the entrance to one of the reserved sections. Notice that the executives and sponsors will be separated from the ordinary visitors by modern, unobtrusive security systems. Studies have shown that separate seating for sponsors and executives will result in higher revenue for the normal seating sections, as the plebeian attendance will not be adversely affected by the antics of the upper classes.

In the finest of classical Roman tradition, the perimeter of the stadium will be ringed with great works of art specially commissioned for this project. Authorities have commissioned a statue for each entry gate IV through XXIX.

One such work, shown here, represents a French interpretation of a Roman copy of a Greek original of the goddess Giunone, despondent over the transgressions of her consort Jupiter. The statue completed and is ready to be moved to it’s place near gate XXIV.

Other great works of art are ready for installation. Here you see a French copy of a Roman interpretation of a Greek original of the same goddess Giunone, after Jupiter came home from work and discovered her in a compromising position with an unnamed consort. This statue is scheduled to be installed near gate XVIII.

To demonstrate the sincerity of the officials, the Ministry of Stadiums has permitted an exclusive inside tour of the factory that produces the great works of art. Below is a photograph of the nursery where the great statues are grown. Viewing from right to left, you can clearly see the maturation of the statues from larvae through pupae, nymph and bewilderment stages.

Completion of the Stadium

Unnamed officials of the Ministry of Stadiums have confirmed that the newly revised schedule for the completion of construction is uncharacteristically aggressive. The officials have assured this reporter that although they have specified 64 bit time_t structs for the project management software, the project will be completed before they are required. Project specifications listed Unix 2038 time compatibility as a requirement only because of EU regulations.

Conclusion

There has been much written about the demise of mainstream newspapers and the effect on the future of investigative reporting. By the example of this report, one should rest assured that the new media stand ready to expose and document the corruption, inefficiencies and transgressions of governments throughout the world.

Oh – and if you haven’t figured it out yet, It’s a joke.

Off Topic: Stadium Construction Scandal

In the center of Rome are the remnants of large stadium. Tradition tells that the stadium was completed during the period of the Roman Empire and allowed to decay, unmaintained, during the centuries since construction.

This photograph, taken with a special filter and timed exactly as the planets Venus and Mars intersected a polyline bounded by the vertices of the tops of the arches and extending in to space, clearly shows that we have been mislead about the true origins of the facility. Careful analysis of the photo shows that the stadium is not a decaying ghost of a once great stadium, but rather it is a uncompleted, scandal ridden construction project gone bad.

By normal Italian standards, construction projects of this nature typically take decades. Corruption and mis-management are assumed, delays are inevitable. But even by those standards, after nearly two millennia the stadium should have been completed. To give a comparative example, the construction on the shopping center in the photograph shown here was started at about the same time as the stadium, and as you can see, today it is nearly 90% complete.

How could it be, that after nearly two millennium of construction, the stadium is not yet complete? Through careful translation of long lost documents and inscriptions on pottery shards, the story of the stadium can finally be told.

The Scandal

The Emperor Flaviodius started building the stadium in 72 AD. Over time it became clear that the Emperor was not particularly adapted to managing large projects. Little did the emperor know that this would someday be the largest construction scandal in the western world.

Shortly after construction began, the ever generous Emperor Flaviodius made provisions for the entertainment of the construction workers. A large circus (Circus Maximus) was built near the site of the stadium. In the premier act of the circus, specially trained Christians performed great feats of daring with hungry lions. The entertainment was such that the workers spent much time in amusement and very little time working on the stadium. After a period of time, Flaviodius discontinued the entertainment, unfortunately without formal consultation with the workers bargaining units. The workers responded with a decades-long work slowdown, causing delays and cost overruns. Flaviodius eventually compensated the workers for the missing entertainment and work resumed.

Scandal again disrupted the schedule shortly after construction resumed. Emperor Flaviodius had to be removed from the project after a late night altercation with visiting Goths (Visigoths). The Emperor, shown above just after the altercation, suffered a broken nose and a bruised chin. The visiting Goths appear to the right in an undated photo, apparently unaware that what seemed to be a minor incident has delayed the largest construction project in Rome. The presence of the Goths, who are culturally adverse to large structures, appears to have caused a work stoppage lasting several centuries.

As with many large projects, revisions to the plans were frequent and seemingly random. Ancient sources indicate that later project managers authorized changes to the shape, orientation, color and number of tiers of seating. The scope changes resulted in a series of project extensions, forcing significant re-work and lost time. Additionally, the project documentation requirements were such that handwritten documentation was impossible to maintain, thereby bringing the project to a standstill until the printing press could be invented.

A major problem throughout the construction was the theft of building materials. When one walks through Rome today, one sees fragments of brick and marble originally purchased for the stadium randomly incorporated into the foundations of other, newer buildings. While theft is common in building projects, in this case it appears to have been on a grand scale. The building material theft caused the construction to stall for much of the period that we call the middle ages. The photograph above shows an example, even to day, of construction materials laying about completely unguarded.

Further delays were apparently caused by a miscommunication between the powerful Italian construction worker unions and the authorities. Unnamed sources indicated that although the union declared a strike, the authorities failed to receive the notification due to the fact that the postal union was also on strike. Because the work on the construction had not noticeably slowed during the strike, it was several hundred years before the authorities noticed the walkout. Negotiations have yet to be restarted.

The investigation continued with calls and e-mails to unnamed officials. When presented with incontrovertible evidence of the scandal, few officials were willing acknowledge the corruption, mismanagement, tangente and incompetence, fearing that the resulting scandal would jeopardize their pension benefits.

Stay tuned for more information as the scandal of the stadium unfolds.

nplus1.org – A Crash Course in Failure

One of the things we system managers dread the most is having the power yanked out from under our servers, something that happens far too frequently (and hits the news pretty regularly). Why? Because we don't trust file systems and databases to gracefully handle abnormal termination. We've all had or heard of file system and database corruption just from a simple power outage. Servers have been getting the power yanked out from under them for five decades, and we still don't trust them to crash cleanly? That's ridiculous. Five decades and thousands of programmer-years of work effort ought to have solved that problem by now. It’s not like it’s going to go away anytime in the next five decades.

In A Crash Course in Failure, Craig Stuntz discusses the concept of building crash only software – or software for which a crash and a normal shutdown are functionally equivalent.

Highlights:

“Hardware will fail. Software will crash. Those are facts of life.”
"…if you believe you have designed for redundancy and availability, but are afraid to hard-fault a rack due to the presence of non-crash-only hardware or software, then you're fooling yourself."
"…maintain savable user data in a recoverable state for the entire lifecycle of your application, and simply do nothing when the system restarts."
“…it is sort of absurd that users have to tell software that they would like to save their work. In truth, users nearly always want to save their work. Extra action should only be required in the unusual case where the user would like to throw their work away.”

Why shouldn't continuous and automatic state saving be the default for any/all applications? A CAD system I bought in 1984 did exactly that. If the system crashed or terminated abnormally, the post-crash reboot would do a complete 'replay' of every edit since the last normal save. In fact you'd have to sit and watch every one of your drawing edits in sequence like a VCR on fast forward, a process that was usually pretty amusing in a Keystone Cops sort of way. It can't be that hard to write serialized changes to the end of the document & only re-write the whole doc when the user explicitly saves the doc or journal every change to another file. That CAD system did it twenty-five years ago on on 4mhz CPU and 8" floppies. Some applications are at least attempting to gracefully recover after a crash, a step in the right direction. It certainly is not any harder than what Etherpad does- and they are doing it multi-user, real time, on the Internet.

“Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. Once you accept that failures will happen, you have the ability to design your system's reaction to specific failures. … If you do not design your failure modes, then you will get whatever unpredictable---and usually dangerous---ones happen to emerge.” -- Michael Nygard

References:
A Crash Course in Failure, Craig Stuntz
Design your Failure Mod es, Michael Janke
'Everything will ultimately fail', Michael Nygard

Error Handling – an Anecdote

A long time ago, shortly after the University I was attending migrated students off of punch cards, I had an assignment to write a batch based hotel room reservation program. We were on top of the world - we had dumb terminals instead of punch cards. The 9600 baud terminals were reserved for professors, but if you got lucky, [WooHoo!] you could get one of the 4800 baud terminals instead of a 2400 or 1200 baud DECwriters.

The instructors mantra - I'll never forget - is that students need to learn how to write programs that gracefully handle errors. 'You don't want an operator calling at 2am telling you your program failed. That sucks.' He was a part time instructor and full time programmer who got tired of getting woke up, and he figured that we needed our sleep, so he made robustness part of his grading criteria.

Here's how he made that stick in my mind for 30 years: When the assignment was handed to us, the instructor gave us the location of sample input data files to use to test our programs. The files were usually laced with data errors. Things like short records, missing fields and random ASCII characters in integer fields were routine, and we got graded on our error handling, so students quickly learned to program with a healthy bit of paranoia and lots of error checking.

That was a great idea and we learned fast. But here's how he caught us all: A few hours before the assignment was due, the instructor gave us a new input file that we had to process with our programs, the results of which would determine our grade.

What was in the final data file?

……[insert drum roll here]……

Nothing. It was a zero byte file.

Try to picture this - the data wasn’t available until a couple hours before the deadline, it was a frantic dash to get a terminal (long lines of students on most days, especially at the end of the semester), edit the source file to gracefully handle the error and exit (think ‘edlin’ or ‘ed’ ), submit it into the batch queue for the compiler (sometimes that queue was backed up for an hour or more) and re-run it against the broken data file, all by the deadline.

How many students caught that error the first time? Not many, certainly not me. My program crashed and I did the frantic thing. The rest of the semester? We all had so dammed many paranoid if-thens in our code you'd probably laugh if you saw it.

He was teaching us to think about building robust programs - to code for what goes wrong, not just what goes right. For him this was an availability problem, not a security problem. But what he taught is relevant today, except the bad guys are feeding your programs the data, not your instructor. That makes it a security problem.

I can't remember the operating system or platform (PDP-something?), I can't remember the language (Pascal, I think, but we learned SNOBOL and FORTH in that class too, so it could have been one of those), but I'll never forget that !@$%^# zero byte file!

Sometimes Hardware is Cheaper than Programmers

In Hardware is Expensive, Programmers are Cheap II I promised that I’d give an example of a case where hardware is cheap compared to designing and building a more efficient application. That post pointed out a case where a relatively small investment in program optimization would have paid itself back by dramatic hardware savings across a small number of the software vendors customers.

Here’s an example of the opposite.

Circa 2000/2001 we started hosting an ASP application running on x86 app servers with a SQL server backend. The hardware was roughly 1Ghz/1GB per app server. Web page response time was a consistent 2000ms. Each app server could handle no more than a handful of page views per second.

By 2004 or so, application utilization grew enough that the page response time and the scalability (page views per server per second) were both considered unacceptable. We did a significant amount of investigation into the application, focusing first on the database, and then on the app servers. After a week or so of data gathering, we determined that the only significant bottleneck was a call to an XSLT/XML transformation function. The details escape me – and aren’t really relevant anyway, but what I remember is that most of the page response time was buried in that library call, and that call used most of the app server CPU. Figuring out how to make the app go faster was pretty straightforward.

The app servers were CPU bound on a single library call.
The library wasn’t going to get re-written or optimized with any reasonable work effort. (If I remember correctly, it was a Microsoft provided library, the software developers only option would and been a major re-write).
The servers were somewhere around 4 years old and due for a routine replacement.
The new servers would clock 3x as fast, have better memory bandwidth and larger caches. The CPU bound library call would likely scale with processor clock speed, and if it fit in the processor cache might scale better than clock.

Conclusion: Buy hardware. In this case, two new app servers replaced four old app servers, the page response time improved dramatically, and the pages views per server per second went up enough to handle normal application growth. It was clear that throwing hardware at the problem was the simplest, cheapest way to make it go away.

In The Quarter Million Dollar Query I outlined how we attached an approximate dollar cost to a specific poorly performing query. “The developers - who are faced with having to balance impossible user requirements, short deadlines, long bug lists, and whiny hosting teams complaining about performance - likely will favor the former over the latter.”

Unless of course they have data comparing hardware, software licenses and hosting costs to their development costs. My preference is to express the operational cost of solving a performance problem in ‘programmer-salaries’ or ‘programmer-months’. Using units like that helps bridge the communication gap.

My conclusion in that post: “To properly prioritize the development work effort, some rational measurement must be made of the cost of re-working existing functionality to reduce [server or database] load verses the value of using that same work effort to add user requested features.”

The Quarter Million Dollar Query
Hardware is Expensive, Programmers are Cheap
Hardware is Expensive, Programmers are Cheap II

Cisco IOS hints and tricks: What went wrong: end-to-end ATM

I enjoy reading Ivan Pepelnjak's Cisco IOS hints and tricks blog. Having been a partner in a state wide ATM wide area network that implemented end to end RSVP, his thoughts on What went wrong: end-to-end ATM are interesting.

I can' figure out how to leave a comment on his blog though, so I'll comment here:

I'd add a couple more reasons for ATM's failure.

(1) Cost. Host adapters, switches and router interfaces were more expensive. ATM adapters used more CPU, so larger routers were needed for a given bandwidth.

(2) Complexity, especially on the LAN side. (On a WAN, ATM isn't necessarily more complex than MPLS for a given functionality. It might even be simpler).

(3) 'Good enough' QOS on ethernet and IP routing. Inferior to ATM? Yes. Good enough? Considering the cost and complexity of ATM, yes.

Ironically, core IP routers maintain a form of session state anyway (CEF).

On an ATM wide are a network, H.323 video endpoints would connect to a gatekeeper and request a bandwidth allocation for a video call to another endpoint (384kbps for example). The ATM network would provision a virtual circuit and guarantee the bandwidth and latency end to end. There was no 'best effort'. If bandwidth wasn't available, rather than allowing new calls to overrun the circuit and degrade existing calls, the new call attempt would fail. If a link failed, the circuit would get re-routed at layer 2, not layer 3. Rather than band-aid-add-on QoS like DSCP and priority queuing, ATM provided reservations and guarantees.

It was a different way of thinking about the network.

No, I Don’t Want iTunes Installed. You can quit asking.

I don’t like software vendors that try to sneak software onto my computers. I really don’t like software vendors that don’t pay attention to my requests to not run in the background at startup.

This evening I came home and saw the Apple Software Update popped up on my Vista desktop:

Problem one: iTunes is check marked by default. I don’t want iTunes. I don’t need iTunes. And I don’t like having software vendors try to sneak software onto my computers. This isn’t unique to Vista. Apple does the same thing on OS X. It’s annoying enough that I’ll probably uninstall Quicktime and throw away the $29 that I paid for it.

Problem two: I specifically instructed Apple’s Quicktime to not automatically update, and I specifically have disabled the Quicktime service from running at startup, but somehow it ran anyway.

I’ve also checked the Software Explorer in Windows Defender and the ‘Run’ registry keys for Apple related startup programs & didn’t find any. I’d sure like to know what’s triggering the Apple updater so I can nuke it.

Something makes me think that the only way I’ll get rid of this malware infestation is to search and destroy all Apple related registry keys.

Availability & SLA’s – Simple Rules

From theDailyWtf, a story about availability & SLA’s that’s worth a read about an impossible availability/SLA conundrum. It’s a good lead in to a couple of my rules of thumb.

“If you add a nine to the availability requirement, you’ll add a zero to the price.”

In other words, to go from 99.9% to 99.99% (adding a nine to the availability requirement), you’ll increase the cost of the project by a factor of 10 (adding a zero to the cost).

There is a certain symmetry to this. Assume that it’ll cost 20,000 to build the system to support three nines, then:

99.9 = 20,000
99.99 = 200,000
99.999 = 2,000,000

The other rule of thumb that this brings up is

Each technology in the stack must be designed for one nine more than the overall system availability.

This one is simple in concept. If the whole system must have three nines, then each technology in the stack (DNS, WAN, firewalls, load balancers, switches, routers, servers, databases, storage, power, cooling, etc.) must be designed for four nines. Why? ‘cause your stack has about 10 technologies in a serial dependency chain, and each one of them contributes to the overall MTBF/MTTR. Of course you can over-design some layers of the stack and ‘reserve’ some outage time for other layers of the stack, but in the end, it all has to add up.

Obviously these are really, really, really rough estimates, but for a simple rule of thumb to use to get business units and IT work groups thinking about the cost and complexity of providing high availability, it’s close enough. When it comes time to sign the SLA, you will have to have real numbers.

Via The Networker Blog

More thoughts on availability, MTTR and MTBF:

NAC or Rootkit - How Would I know?

I show up for a meeting, flip open my netbook and start looking around for a wireless connection. The meeting host suggests an SSID. I attach to the network and get directed to a captive portal with an ‘I agree’ button. I press the magic button an get a security warning dialogue.

It looks like the network is NAC’d. You can’t tell that from the dialogue though. ‘Impluse Point LLC’ could be a NAC vendor or a malware vendor. How would I know? If I were running a rouge access point and wanted to install a root kit, what would it take to get people to run the installer? Probably not much. We encourage users to ignore security warnings.

Anyway – it was amusing. After I switched to my admin account and installed the ~~‘root kit’~~ service agent and switched back to my normal user, I got blocked anyway. I’m running Windows 7 RC without anti-virus. I guess NAC did what it was supposed to do. It kept my anti-virus free computer off the network.

I’d like someone to build a shim that fakes NAC into thinking I’ve got AV installed. That’d be useful.

Consulting Fail, or How to Get Removed from my Address Book

Here’s some things that consultants do that annoy me.

Some consultants brag about who is backing their company or whom they claim as their customers. I’ve never figured that rich people are any smarter than poor people so I’m not impressed by consultants who brag about who is backing them or who founded their company. Recent ponzi and hedge fund implosions confirm my thinking. And it seems like the really smart people who invented technology 1.0 and made a billion are not reliably repeating their success with technology 2.0. It happens, but not predictably, so mentioning that [insert famous web 1.0 person here] founded or is backing your company is a waste of a slide IMHO.

I’m also not impressed by consultants who list [insert Fortune 500 here] as their clients. Perhaps [insert Fortune 500 here] has a world class IT operation and the consultant was instrumental in making them world class. Perhaps not. I have no way of knowing. It’s possible that some tiny corner of [insert Fortune 500 here] hired them to do [insert tiny project here] and they screwed it up, but that’s all they needed to brag about how they have [insert Fortune 500 here] as their customer and add another logo to their power point.

I’m really unimpressed when consultants tell me that they are the only ones who are competent enough to solve my problems or that I’m not competent enough to solve my own problems. One consulting house tried that on me years ago, claiming that firewalling fifty campuses was beyond the capability of ordinary mortals, and that If we did it ourselves, we’d botch it up. That got them a lifetime ban from my address book. They didn’t know that we had already ACL’d fifty campuses, and that inserting a firewall in line with a router was a trivial network problem, and that converting the router ACL’s to firewall rules was scriptable, and that I already written the script.

I’ve also had consultants ‘accidently’ show me ‘secret’ topologies for the security perimeters of [insert fortune 500 here] on their conference room white board. Either they are incompetent for disclosing customer information to a third party, or they drew up a bogus whiteboard to try to impress me. Either way I’m not impressed. Another lifetime ban.

Consultants who attempt to implement technology or projects or processes that the organization can’t support or maintain is another annoyance. I’ve see people come in and try to implement processes or technologies that although they might be what the book says or what every one else is doing, aren’t going to fit the organization, for whatever reason. If the organization can’t manage the project, application or technology after the consultant leaves, a perceptive consultant will steer the client towards a solution that is manageable and maintainable. In some cases, the consultant obtained the necessary perception only after significant effort on my part with the verbal equivalent of a blunt object.

Recent experiences with a SaaS vendor annoyed me pretty badly when they insisted on pointing out how great their whole suite of products integrate, even after I repeatedly and clearly told them I was only interested in one small product, and they were on site to tell me about that product, and nothing else. “I want to integrate your CMDB with MY existing management infrastructure, not YOUR whole suite. Next slide please. <dammit!>”. Then it went down hill. I asked them what protocols they use to integrate their product with other products in their suite. The reply: a VPN. Technically they weren’t consultants though. They were pre-sales.

That’s not to say that I’m anti consultant. I’ve seen many very competent consultants who have done an excellent job. At times I’ve been extremely impressed.

Obviously I’ve also been disappointed.

Your Application is a Rotting Old Shack, Now What?

In response to A Shack in the Woods, Crumbling at the Core, colleague Jim Graves commented:

“…it only works if application owners are like long-term homeowners, not house flippers.”

Good point. Who cares if the shack gets a cheap paint job instead of a foundation and a comprehensive re-modeling? Will the business owner know or care? Do the contractors you hired care? Are you going to be around long enough to care? Are you and your employees, managers and consultants acting as house job flippers, painting over the flaws so you can update your resumes, take the profits and move on?

Jim asks:

“Are long-term employees more likely to care about problems that may happen five years from now? Are Highly Paid Consultants much less likely to?”

Good question. Suppose that I want to fix the shack. Maybe I’m tired of having to empty the buckets that catch the drips from the roof (or restart the J2EE app that runs itself out of database connections a couple times a week). If this repair is to be anything other than a paint-over, at least one element in the business owner-->employee-->consultant-->contractor chain will have to care enough about the application to ensure that the remodeling is oriented toward long term structural repairs and not just a paint-over. The other elements in the chain need to concur.

For much of what I host and/or manage, I’m on the second or third remodeling cycle. I’ve seen the consultants parachute in, slap a coat of paint on the turd and walk off with a half decade of my salary. I don’t like it. This puts me squarely in the camp of putting the effort into fixing foundations instead of slapping on paint and shingles. I’ve seen apps that have been around for ten years and two refresh cycles, have had ten million dollars spent on them and still have mold, rot and leaks from a decade ago. But they have a shiny new skin and state of the art UI. For wireframes, UI models, usability studies and really nice Power Points? Spare no expense. Do they have a sane data model and even trivial application security? Not even close. Granite countertops are in scope, fixing the foundation is out of scope.

Things like that make me crabby.

For now I’ll assume that the blame lies with IT. Somewhere along the line the non-technical business owners have been led to believe that the shiny face and the UI is the application, and that the foundational elements (back end code, databases, servers, networks and security) are invisible, unimportant, or otherwise non-essential.

Resume Driven Design

Sam Buchanan, a long time colleague, commenting on a consultants design for a small web application:

“I'm telling you: this app reeks of resume-driven design”

In ‘Your Application is a Rotting old Shack’ I ~~whined~~ mused about applications that get face lifts while core problems get ignored. Let’s assume for a moment that business units finally figure out that their apps have a crumbling foundation and need structural overhauls. Assuming that internal resources don’t exist, how do we know that the consultants and contractors that we hire to design and build our systems aren’t more interested in building their resumes than our applications?

I’d like to think that I would be able to tell if a consultant tried to recommend an architecture or design that exists more to pad their resume than solve my problems. It’s probably not that straight forward though. Consultants have motivations that may intersect with your needs, or they may have motivations that significantly deviate from what you need, and if their motivations are resume driven, there is a chance that you’ll end up with a design that helps someone's resume more that it helps you.

Short term employees may share some of the same motivations. If they are using you to fill out their resume, you’d better have needs that line up with the holes in their resume. I’m pretty sure that ‘slogged through a decade old poorly written application, identified unused code and database objects’ or ‘documented and cleaned up an ad hoc, poorly organized, data model’ isn’t the first thing people want on their resume.

They probably want something shiny.

A Mime in a Box

Picking up on a thread by Andy the IT Guy, which of these things is not like the other?

A developer who doesn’t understand databases, networks or firewalls.
A system manager or DBA who doesn’t understand applications, networks and firewalls.
A firewall or network administrator who doesn’t understand operating systems and applications.
A mime in a box.

Trick question. They’re the same. The mime’s box is imaginary, as are the cross disciplinary restrictions that we place on developers, system and network administrators.

In the example from Andy’s post, the developer didn’t understand the difference between an app installed on a desktop and an app installed on a server. Similarly, non-network people often don’t understand the critical difference between source and destination when an app server connects to a database.

For example, I often see this diagram:

showing an application updating a database, when from a network point of view, what we really need to see is:

showing the application making a network connection to the database. But that subtle difference doesn’t mean much unless the person understands firewalls. They’ll need to understand them though, because I’m going to do this:

If they don’t understand the difference between TCP and UCP, between Inbound and Outbound and between Source and Destination, that firewall is probably going to break things.

This problem seems to occur nearly universally.

Let's call this System Management Principle #5:

Each technology specialist must understand enough about the adjoining technologies to design and build systems that make maximum use of those technologies.

(If you’ve got a better way of phrasing that, let me know.)

I’ve got System Management Principle #6, and now, with Andy’s help, principle #5. Someday we’ll dream up the rest of them.
Photo by B. Tse, found here.

Home Server Energy Consumption

I'm moving toward 'less is more', where 'less' is measured in watts. Right now my entire home entertainment and technology stack uses about 150 watts total (server + network + storage + Sun Rays + laptops + wall warts). I no longer use the stereo or television - that stack is unplugged and consuming zero energy, and I don’t have any watt sucking game consoles. My next iteration of home entertainment & technology should use about 25 watts for all servers and storage and about 20 watts for each user end point (laptop). The server and network should be the only devices that run continuously. End points should suspend and resume quickly and reliably so that no more than one is normally running at a time, so the net of all server, network and user devices should be under 50 watts.

Expecting Stewardship Without Understanding

What are the consequences of building a society where we rely on technology that we don’t understand? Is lack of stewardship one of those consequences?

From Wayne Porter:

Most people no longer understand anything about the technology they use everyday and because of this ignorance many people use it without good stewardship. We drive cars we cannot fix, eat food we cannot make or produce, and many operate in an environment they do not understand with a false sense of security. We run and gun this technology with fuel that has probably reached its peak point.

Can we expect people who don’t understand a technology to be good stewards of the technology?
Should we expect application developers, who largely don’t understand relational databases, database security, firewalls or networks, to write applications that rationally utilize or properly protect those resources? Should we expect ordinary computer users, who understand almost nothing of how their computers work, to operate their computers in a manner that protects them and us from themselves and the Internet?

For some technologies (automobiles for example) we’ve almost completely given up on users understanding the technology well enough to make rational decisions and exhibit good stewardship. Drivers will never understand tire contact patches and slip angles, so we give them speed limits, ABS brakes, stability control, crumple zones and air bags. Drivers don’t understand engines and engine maintenance, so we give them idiot lights and dashboard messages. Drivers don’t understand the consequences of fossil fuel consumption, so we legislate minimum mileage and emission standards. We force drivers to be good stewards whether they like it or not.

Home owners don’t understand strength of beams, dynamic wind loads and electricity's propensity to escape to the ground via the path of least resistance, so we have building codes and permits, building inspectors, fire inspectors, licensed contractors and tradesmen to force homeowners into reasonable stewardship of their property.

On the other hand, most computer users don’t have even a basic understanding of how their computer works, yet we give them administrative access, allow them to install random software from the Internet, and then somehow expect them to keep their computer secure and functional. We expect them to be good stewards of the technology and not allow their home computers to be malware infested botnet nodes without them having even a vague understanding of how their computer works.

That’s probably not going to work.