Last In - First Out: July 2009

Quote Worth Remembering

"For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."

Personal observations on the reliability of the Shuttle

by R. P. Feynman

Infrastructure – Security and Patching

An MRI machine hosting Confliker:

“The manufacturer of the devices told them none of the machines were supposed to be connected to the Internet and yet they were […] the device manufacturer said rules from the U.S. Food and Drug Administration required that a 90-day notice be given before the machines could be patched.”

Finding an unexpected open firewall hole or a a device that isn’t supposed to be on the Internet is nothing new or unusual. If someone asked “what’s the probability that a firewall has too many holes” or “how likely is it that something got attached to the network that wasn’t supposed to be”, in both cases I’d say the probability is one.

Patching a machine that can’t be patched for 90 days after the patch is released is a pain. It’s an exception, and exceptions cost time an money.

Patching a machine that isn’t supposed to be connected to the Internet is a pain. I’m assuming that one would need to build a separate ‘dark net’ for the machines. I can’t imagine walking around with a CD and patching them.

Locating and identifying every operating system instance in a large enterprise is difficult, especially when the operating systems are packaged as a unit with an infrastructure device of some sort. Assuring that they all are patched is non-trivial. When vendors package an operating system (Linux, Windows) in with a device, they rarely acknowledge that you or they need to harden, patch, and update that operating system.

Major vendors have Linux and Windows devices that they refer to as ‘SAN Management Appliances’, ‘Enterprise Tape Libraries’, and ‘Management Consoles’. They rarely acknowledge that the underlying OS needs to be hardened and patched, and sometimes even prohibit customer hardening and patching. The vendor supplies a ‘turnkey system’ or ‘appliance’ and fails to manage the patches on the same schedule as the OS that they embedded into their ‘appliance’.

This isn’t a Microsoft problem. Long before Windows was considered fit to be used for infrastructure devices (building controls, IVR, HVAC, etc) hackers were routinely root kitting the Solaris and Linux devices that were running the infrastructure. We tend to forget that though.

Off Topic: Stadium Construction Resumes (Update)

UPDATE: The Italian authorities have responded to the exposure of the corruption surrounding the construction of the stadium. The Minister of Stadiums has investigated the mismanagement and corruption, issued a report, and taken corrective action. The corrupt and incompetent officials have been identified, prosecuted and punished. Additionally, the authorities have agreed to resolve other outstanding issues that have prevented the completion of the project.

Punishment

In an unusual turn of events, the authorities have issued severe consequences for the perpetrators of the scandal. Here you can see the gruesome results of this most effective punishment. The statues of the wives of the corrupt officials have been decapitated (NSFW).

Shocking though it may be, this extreme punishment, reserved exclusively for the most severe offenses, has a long tradition in Italian culture. Over the millennia, many statues of spouses of famous officials have been similarly mutilated and placed for exhibition in great air conditioned halls. In many cases admission is charged for viewing the mutilated statues.

Improved Security

To prevent a reoccurrence of the material thefts, guards have been hired to prevent the misappropriation of construction materials. In this exclusive undercover photo, taken at great risk by staff reporters of this publication, guards in uniform are secretly training for their role in the protection of the project. The Minister of Stadiums has assured us that the guards have been trained to defend the construction materials using all means available. Unnamed officials have confirmed that in extreme cases, guards may be authorized to defend the construction materials with realistic plastic swords.

To supplement the guards, the Ministry has installed special road materials on all paths leading from the construction site. These roads are cleverly designed to impart a vibration of a specific destructive frequency and amplitude to any ~~Chrysler~~ Fiat that passes over the roads. The ~~Chrysler~~ Fiats are not expected to traverse more than half way up the path before requiring the intervention of a mechanic, giving the authorities sufficient time to apprehend the thieves.

Resumption of Construction

The security measures have been deemed sufficient to permit the resumption of construction on the stadium. New groups of highly skilled workers are restarting the project. These workers have been specially recruited from all parts of the world, arriving daily in great numbers by airplane, bus, car and train. As you can see in the photograph to the right, the Italian authorities have spared no expense in this area.

As shown in the photo below, the new workers have made great progress in the very short amount of time since the resumption of construction. Scaffolding has already been erected and workers have restarted construction on the western facade.

Authorities have assured this reporter that the problems have been resolved and are optimistic that the new workers will make rapid progress on the stadium.

Enhancements to the Original Design

Additionally, according to stadium officials, significant improvements will be made to the original design. The changes are specifically designed to improve the usability and assure a steady revenue stream. Assurances have been given that the additions to the project scope will not increase the cost of the project nor delay its completion.

In the first enhancement, a new steel playing field is being built to replace the original wood and masonry field. The new field is a significant improvement to the original field and is expected to last as long as the construction project. The new playing field, shown here, is already partially constructed. Authorities claim the new field will result in less severe injuries and fewer weather related event cancellations. Additionally, the new playing field will permit the display of banner ads directly on the surface of the field.

As with modern stadiums, special seating will be built for sponsors and executives. A premium will be charged for the seating and the proceeds will be used to finance the completion of the stadium. The special seating is expected to generate significant revenue toward the completion of the stadium. Pictured here is the entrance to one of the reserved sections. Notice that the executives and sponsors will be separated from the ordinary visitors by modern, unobtrusive security systems. Studies have shown that separate seating for sponsors and executives will result in higher revenue for the normal seating sections, as the plebeian attendance will not be adversely affected by the antics of the upper classes.

In the finest of classical Roman tradition, the perimeter of the stadium will be ringed with great works of art specially commissioned for this project. Authorities have commissioned a statue for each entry gate IV through XXIX.

One such work, shown here, represents a French interpretation of a Roman copy of a Greek original of the goddess Giunone, despondent over the transgressions of her consort Jupiter. The statue completed and is ready to be moved to it’s place near gate XXIV.

Other great works of art are ready for installation. Here you see a French copy of a Roman interpretation of a Greek original of the same goddess Giunone, after Jupiter came home from work and discovered her in a compromising position with an unnamed consort. This statue is scheduled to be installed near gate XVIII.

To demonstrate the sincerity of the officials, the Ministry of Stadiums has permitted an exclusive inside tour of the factory that produces the great works of art. Below is a photograph of the nursery where the great statues are grown. Viewing from right to left, you can clearly see the maturation of the statues from larvae through pupae, nymph and bewilderment stages.

Completion of the Stadium

Unnamed officials of the Ministry of Stadiums have confirmed that the newly revised schedule for the completion of construction is uncharacteristically aggressive. The officials have assured this reporter that although they have specified 64 bit time_t structs for the project management software, the project will be completed before they are required. Project specifications listed Unix 2038 time compatibility as a requirement only because of EU regulations.

Conclusion

There has been much written about the demise of mainstream newspapers and the effect on the future of investigative reporting. By the example of this report, one should rest assured that the new media stand ready to expose and document the corruption, inefficiencies and transgressions of governments throughout the world.

Oh – and if you haven’t figured it out yet, It’s a joke.

Off Topic: Stadium Construction Scandal

In the center of Rome are the remnants of large stadium. Tradition tells that the stadium was completed during the period of the Roman Empire and allowed to decay, unmaintained, during the centuries since construction.

This photograph, taken with a special filter and timed exactly as the planets Venus and Mars intersected a polyline bounded by the vertices of the tops of the arches and extending in to space, clearly shows that we have been mislead about the true origins of the facility. Careful analysis of the photo shows that the stadium is not a decaying ghost of a once great stadium, but rather it is a uncompleted, scandal ridden construction project gone bad.

By normal Italian standards, construction projects of this nature typically take decades. Corruption and mis-management are assumed, delays are inevitable. But even by those standards, after nearly two millennia the stadium should have been completed. To give a comparative example, the construction on the shopping center in the photograph shown here was started at about the same time as the stadium, and as you can see, today it is nearly 90% complete.

How could it be, that after nearly two millennium of construction, the stadium is not yet complete? Through careful translation of long lost documents and inscriptions on pottery shards, the story of the stadium can finally be told.

The Scandal

The Emperor Flaviodius started building the stadium in 72 AD. Over time it became clear that the Emperor was not particularly adapted to managing large projects. Little did the emperor know that this would someday be the largest construction scandal in the western world.

Shortly after construction began, the ever generous Emperor Flaviodius made provisions for the entertainment of the construction workers. A large circus (Circus Maximus) was built near the site of the stadium. In the premier act of the circus, specially trained Christians performed great feats of daring with hungry lions. The entertainment was such that the workers spent much time in amusement and very little time working on the stadium. After a period of time, Flaviodius discontinued the entertainment, unfortunately without formal consultation with the workers bargaining units. The workers responded with a decades-long work slowdown, causing delays and cost overruns. Flaviodius eventually compensated the workers for the missing entertainment and work resumed.

Scandal again disrupted the schedule shortly after construction resumed. Emperor Flaviodius had to be removed from the project after a late night altercation with visiting Goths (Visigoths). The Emperor, shown above just after the altercation, suffered a broken nose and a bruised chin. The visiting Goths appear to the right in an undated photo, apparently unaware that what seemed to be a minor incident has delayed the largest construction project in Rome. The presence of the Goths, who are culturally adverse to large structures, appears to have caused a work stoppage lasting several centuries.

As with many large projects, revisions to the plans were frequent and seemingly random. Ancient sources indicate that later project managers authorized changes to the shape, orientation, color and number of tiers of seating. The scope changes resulted in a series of project extensions, forcing significant re-work and lost time. Additionally, the project documentation requirements were such that handwritten documentation was impossible to maintain, thereby bringing the project to a standstill until the printing press could be invented.

A major problem throughout the construction was the theft of building materials. When one walks through Rome today, one sees fragments of brick and marble originally purchased for the stadium randomly incorporated into the foundations of other, newer buildings. While theft is common in building projects, in this case it appears to have been on a grand scale. The building material theft caused the construction to stall for much of the period that we call the middle ages. The photograph above shows an example, even to day, of construction materials laying about completely unguarded.

Further delays were apparently caused by a miscommunication between the powerful Italian construction worker unions and the authorities. Unnamed sources indicated that although the union declared a strike, the authorities failed to receive the notification due to the fact that the postal union was also on strike. Because the work on the construction had not noticeably slowed during the strike, it was several hundred years before the authorities noticed the walkout. Negotiations have yet to be restarted.

The investigation continued with calls and e-mails to unnamed officials. When presented with incontrovertible evidence of the scandal, few officials were willing acknowledge the corruption, mismanagement, tangente and incompetence, fearing that the resulting scandal would jeopardize their pension benefits.

Stay tuned for more information as the scandal of the stadium unfolds.

nplus1.org – A Crash Course in Failure

One of the things we system managers dread the most is having the power yanked out from under our servers, something that happens far too frequently (and hits the news pretty regularly). Why? Because we don't trust file systems and databases to gracefully handle abnormal termination. We've all had or heard of file system and database corruption just from a simple power outage. Servers have been getting the power yanked out from under them for five decades, and we still don't trust them to crash cleanly? That's ridiculous. Five decades and thousands of programmer-years of work effort ought to have solved that problem by now. It’s not like it’s going to go away anytime in the next five decades.

In A Crash Course in Failure, Craig Stuntz discusses the concept of building crash only software – or software for which a crash and a normal shutdown are functionally equivalent.

Highlights:

“Hardware will fail. Software will crash. Those are facts of life.”
"…if you believe you have designed for redundancy and availability, but are afraid to hard-fault a rack due to the presence of non-crash-only hardware or software, then you're fooling yourself."
"…maintain savable user data in a recoverable state for the entire lifecycle of your application, and simply do nothing when the system restarts."
“…it is sort of absurd that users have to tell software that they would like to save their work. In truth, users nearly always want to save their work. Extra action should only be required in the unusual case where the user would like to throw their work away.”

Why shouldn't continuous and automatic state saving be the default for any/all applications? A CAD system I bought in 1984 did exactly that. If the system crashed or terminated abnormally, the post-crash reboot would do a complete 'replay' of every edit since the last normal save. In fact you'd have to sit and watch every one of your drawing edits in sequence like a VCR on fast forward, a process that was usually pretty amusing in a Keystone Cops sort of way. It can't be that hard to write serialized changes to the end of the document & only re-write the whole doc when the user explicitly saves the doc or journal every change to another file. That CAD system did it twenty-five years ago on on 4mhz CPU and 8" floppies. Some applications are at least attempting to gracefully recover after a crash, a step in the right direction. It certainly is not any harder than what Etherpad does- and they are doing it multi-user, real time, on the Internet.

“Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. Once you accept that failures will happen, you have the ability to design your system's reaction to specific failures. … If you do not design your failure modes, then you will get whatever unpredictable---and usually dangerous---ones happen to emerge.” -- Michael Nygard

References:
A Crash Course in Failure, Craig Stuntz
Design your Failure Mod es, Michael Janke
'Everything will ultimately fail', Michael Nygard

Error Handling – an Anecdote

A long time ago, shortly after the University I was attending migrated students off of punch cards, I had an assignment to write a batch based hotel room reservation program. We were on top of the world - we had dumb terminals instead of punch cards. The 9600 baud terminals were reserved for professors, but if you got lucky, [WooHoo!] you could get one of the 4800 baud terminals instead of a 2400 or 1200 baud DECwriters.

The instructors mantra - I'll never forget - is that students need to learn how to write programs that gracefully handle errors. 'You don't want an operator calling at 2am telling you your program failed. That sucks.' He was a part time instructor and full time programmer who got tired of getting woke up, and he figured that we needed our sleep, so he made robustness part of his grading criteria.

Here's how he made that stick in my mind for 30 years: When the assignment was handed to us, the instructor gave us the location of sample input data files to use to test our programs. The files were usually laced with data errors. Things like short records, missing fields and random ASCII characters in integer fields were routine, and we got graded on our error handling, so students quickly learned to program with a healthy bit of paranoia and lots of error checking.

That was a great idea and we learned fast. But here's how he caught us all: A few hours before the assignment was due, the instructor gave us a new input file that we had to process with our programs, the results of which would determine our grade.

What was in the final data file?

……[insert drum roll here]……

Nothing. It was a zero byte file.

Try to picture this - the data wasn’t available until a couple hours before the deadline, it was a frantic dash to get a terminal (long lines of students on most days, especially at the end of the semester), edit the source file to gracefully handle the error and exit (think ‘edlin’ or ‘ed’ ), submit it into the batch queue for the compiler (sometimes that queue was backed up for an hour or more) and re-run it against the broken data file, all by the deadline.

How many students caught that error the first time? Not many, certainly not me. My program crashed and I did the frantic thing. The rest of the semester? We all had so dammed many paranoid if-thens in our code you'd probably laugh if you saw it.

He was teaching us to think about building robust programs - to code for what goes wrong, not just what goes right. For him this was an availability problem, not a security problem. But what he taught is relevant today, except the bad guys are feeding your programs the data, not your instructor. That makes it a security problem.

I can't remember the operating system or platform (PDP-something?), I can't remember the language (Pascal, I think, but we learned SNOBOL and FORTH in that class too, so it could have been one of those), but I'll never forget that !@$%^# zero byte file!

Sometimes Hardware is Cheaper than Programmers

In Hardware is Expensive, Programmers are Cheap II I promised that I’d give an example of a case where hardware is cheap compared to designing and building a more efficient application. That post pointed out a case where a relatively small investment in program optimization would have paid itself back by dramatic hardware savings across a small number of the software vendors customers.

Here’s an example of the opposite.

Circa 2000/2001 we started hosting an ASP application running on x86 app servers with a SQL server backend. The hardware was roughly 1Ghz/1GB per app server. Web page response time was a consistent 2000ms. Each app server could handle no more than a handful of page views per second.

By 2004 or so, application utilization grew enough that the page response time and the scalability (page views per server per second) were both considered unacceptable. We did a significant amount of investigation into the application, focusing first on the database, and then on the app servers. After a week or so of data gathering, we determined that the only significant bottleneck was a call to an XSLT/XML transformation function. The details escape me – and aren’t really relevant anyway, but what I remember is that most of the page response time was buried in that library call, and that call used most of the app server CPU. Figuring out how to make the app go faster was pretty straightforward.

The app servers were CPU bound on a single library call.
The library wasn’t going to get re-written or optimized with any reasonable work effort. (If I remember correctly, it was a Microsoft provided library, the software developers only option would and been a major re-write).
The servers were somewhere around 4 years old and due for a routine replacement.
The new servers would clock 3x as fast, have better memory bandwidth and larger caches. The CPU bound library call would likely scale with processor clock speed, and if it fit in the processor cache might scale better than clock.

Conclusion: Buy hardware. In this case, two new app servers replaced four old app servers, the page response time improved dramatically, and the pages views per server per second went up enough to handle normal application growth. It was clear that throwing hardware at the problem was the simplest, cheapest way to make it go away.

In The Quarter Million Dollar Query I outlined how we attached an approximate dollar cost to a specific poorly performing query. “The developers - who are faced with having to balance impossible user requirements, short deadlines, long bug lists, and whiny hosting teams complaining about performance - likely will favor the former over the latter.”

Unless of course they have data comparing hardware, software licenses and hosting costs to their development costs. My preference is to express the operational cost of solving a performance problem in ‘programmer-salaries’ or ‘programmer-months’. Using units like that helps bridge the communication gap.

My conclusion in that post: “To properly prioritize the development work effort, some rational measurement must be made of the cost of re-working existing functionality to reduce [server or database] load verses the value of using that same work effort to add user requested features.”

The Quarter Million Dollar Query
Hardware is Expensive, Programmers are Cheap
Hardware is Expensive, Programmers are Cheap II

Cisco IOS hints and tricks: What went wrong: end-to-end ATM

I enjoy reading Ivan Pepelnjak's Cisco IOS hints and tricks blog. Having been a partner in a state wide ATM wide area network that implemented end to end RSVP, his thoughts on What went wrong: end-to-end ATM are interesting.

I can' figure out how to leave a comment on his blog though, so I'll comment here:

I'd add a couple more reasons for ATM's failure.

(1) Cost. Host adapters, switches and router interfaces were more expensive. ATM adapters used more CPU, so larger routers were needed for a given bandwidth.

(2) Complexity, especially on the LAN side. (On a WAN, ATM isn't necessarily more complex than MPLS for a given functionality. It might even be simpler).

(3) 'Good enough' QOS on ethernet and IP routing. Inferior to ATM? Yes. Good enough? Considering the cost and complexity of ATM, yes.

Ironically, core IP routers maintain a form of session state anyway (CEF).

On an ATM wide are a network, H.323 video endpoints would connect to a gatekeeper and request a bandwidth allocation for a video call to another endpoint (384kbps for example). The ATM network would provision a virtual circuit and guarantee the bandwidth and latency end to end. There was no 'best effort'. If bandwidth wasn't available, rather than allowing new calls to overrun the circuit and degrade existing calls, the new call attempt would fail. If a link failed, the circuit would get re-routed at layer 2, not layer 3. Rather than band-aid-add-on QoS like DSCP and priority queuing, ATM provided reservations and guarantees.

It was a different way of thinking about the network.