Thursday, December 31, 2009


The future:

Cameras will be ubiquitous. Storage will be effectively infinite. CPU processing power will be effectively infinite. Cameras will detect a broad range of the electromagnetic spectrum. The combination of cameras everywhere and infinite storage will inevitably result in all persons being under surveillance all the time. When combined with infinite processor power and recognition software, it will be impossible for persons to move about society without being observed by their government.

All governments eventually are corrupted and when corrupted will misuse the surveillance data. There is no particular reason to think that this is political party or left/right specific. Although it currently is fashionable to think that the right is evil and the left is good, there is no reason to think this will be the case in the future. The only certainty is that party in power will misuse the data to attempt to control their ‘enemies’, whomever they might perceive them be at the time.

Surveillance advocates claim that cameras are simply an extension of law enforcements eyes and therefore are not a significant new impingement on personal freedom.

I disagree.

Here’s how I’d build a surveillance system that allows the use of technology to maximize law enforcement effectiveness, yet provide reasonable controls on the use of the surveillance against the population as a whole.

  • The cameras are directly connected to a control room of some sort. The control room is monitored by sworn, trained law enforcement officers. The officers watch the monitors.
So far this is ordinary surveillance. Here’s how I’d protect individual privacy:
  • The locations of the cameras are well known.
  • All cameras record to volatile memory only. The capacity of the volatile memory is a small, on the order of one hour or so. Unless specific action is taken, all recorded data more than one hour old is automatically and irretrievably lost. A ring buffer of some sort.
  • If a sworn officer sees a crime, the sworn officer may switch specific cameras to non-volatile storage. The action to switch a camera from volatile to non volatile storage is deliberate and only taken when an officer sees specific events that constitute probable cause that a crime is being committed, or when a crime has been reported to law enforcement. Each instance of the use of non-volatile storage is recorded, documented and discoverable by the general public using some well defined process.
  • Once a camera is switched to non-volatile storage, it automatically reverts to volatile storage after a fixed time period (one hour, for example), unless the sworn officer repeatedly toggles the non-volatile switch on the camera.
  • The non-volatile storage automatically expires after a fixed amount of time (24 hours, for example). If law enforcement believes that a crime has occurred and that the video will be evidence in the crime, law enforcement obtains a court order to retain the video evidence and move it to permanent storage. The court order must be for a specific crime and must name specific cameras and times.
  • When a court so orders, the video is moved from non-volatile storage to whatever method law enforcement uses for retaining and managing evidence. If the court order is not obtained within the non-volatile expiration period, the video is irretrievably deleted. If the court order is obtained, the video becomes subject to whatever rules govern evidence in the legal jurisdiction of the cameras.

In the case of a 9/11 or 7/7 type of event, the officer would simply toggle all cameras non-volatile mode and would continue to re-enable non-volatile mode every hour for as long as necessary (days, if necessary). The action of toggling the cameras would be again be recorded, documented and discoverable.

To prevent the system from being subverted by corrupt law enforcement, (think J Edgar Hoover and massive illegal surveillance) the systems would be physically sealed, the software and storage for the non-volatile and volatile storage would be unavailable to law enforcement.

There would be some form of crypto/hash/signing  that enables tracking the recordings back to a specific camera and assures that the recordings have not been altered by law enforcement.

The key concepts are:

  1. the system defaults to automatically destroying all recordings automatically.
  2. a sworn officer of the law must observe an event before triggering non-volatile storage.
  3. specific actions are required to store the recordings
  4. those actions are  logged, documented and discoverable.
  5. a court action of some sort is required for storage of any recording beyond a short period of time.
  6. the system would be tamper-proof. The act of law enforcement tampering with the systems to defeat the privacy controls would be a felony.
  7. the system would maintain the integrity of the recordings for as long as the video exists.

And most importantly, the software would be open source.

EPIC Video Surveillance and Wikipedia contain quite a few other thoughts on surveillance.

Thursday, December 3, 2009

IP addresses for well known services

I don't have an opinion (yet) on Google's new DNS service, but I have to admit that they snagged some pretty cool IP addresses for their public resolvers:

DNS is one of the few services that benefits from a memorizable IP address.

Google chose wisely.

Tuesday, November 24, 2009

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia

cargo_cult During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.

The question is, how amusing are we?

We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?

Here’s some common system administration failures that might be ‘cargo cult’:

Failing to understand the difference between necessary and sufficient. A daily backup is necessary, but it may not be sufficient to meet RPO and RTO requirements.

Failing to understand the difference between causation and correlation.[4] Event A may have caused Event B, or some third event may have caused A and B, or the two events may be unrelated and coincidental.

Failing to understand the difference between cause and effect.

Following a security recipe without understanding the risks you are addressing.  If you don't understand how hackers infiltrate your systems and ex-filtrate your data, then your DLP, Firewalls, IDS, SEIM, etc. are cargo cult. You've built the superficial exterior of a system without understanding the underlying substance. If you do understand how your systems get infiltrated, then you'll probably consider simple controls like database and file system permissions and auditing as important as expensive, complex packaged products.

Asserting that [Technology O] or [Platform L] or [Methodology A] is inherently superior to all others and blindly applying it to all problems. When you make such claims, are you applying science or religion?

Systematic troubleshooting is one of the hardest parts of system management and often the first to 'go cargo'. Here’s some examples:

Treating symptoms, not causes. A reboot will not solve your problem. It may make the problem go away for a while, but your problem still exists. You've addressed the symptom of the problem (memory fragmentation, for example), not the cause of the problem (a memory leak, for example).

Troubleshooting without a working hypothesis.

Changing more than one thing at a time while troubleshooting. If you make six changes and the problem went away, how will you determine root cause? Or worse, which of the six changes will cause new problems at a future date?

Making random changes while troubleshooting. Suppose you have a problem with an (application|operating system|database) and you hypothesize that changing a parameter will resolve the problem, so you change the parameter. If the problem reoccurs your hypothesis was wrong, right?

Troubleshooting without measurements or data.

Troubleshooting without being able to recreate the problem.

Troubleshooting application performance without a benchmark to compare performance against. If you don’t know what’s normal, how do you know what’s not normal?

Blaming the (network|firewall|storage) without analysis or hypothesis that points to either. One of our application vendors insisted that the 10mbps of traffic on a 100mbps interface was the cause of the slow application, and we needed to upgrade to GigE. We upgraded it (overnight), just to shut them up. Of course it didn't help. Their app was broke.

Blaming the user or the customer, without an analysis or hypothesis that points to them as the root cause. A better plan would be actually find the problem and fix it.

Declaring that the problem is fixed without determining the root cause. If you don't know the root cause, but the problem appears to have gone away, you haven't solved the problem, you've only observed that the problem went away. Don't worry, it'll come back, just after you’ve written an e-mail to management describing how you’ve “fixed” the problem.

It's easy to fall into cargo cult mode.

Just re-boot it, it'll be fine.

[1] Richard Fenymen, CARGO CULT SCIENCE:
Mike Speiser, Cargo Cult Managment:
[3] Wikipedia: Cargo Cult Programming:
[4] L. KIP WHEELER, Correlation and Causation:

Saturday, November 14, 2009

Degraded Operations - Gracefully

From James Hamilton’s Degraded Operations Mode:
“In Designing and Deploying Internet Scale Services I’ve argued that all services should expect to be overloaded and all services should expect mass failures.  Very few do and I see related down-time in the news every month or so.....We want all system to be able to drop back to a degraded operation mode that will allow it to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures.”

I've had high visibility applications fail into 'degraded operations mode'. Unfortunately it has not always been a designed, planned or tested failure mode, but rather a quick reaction to an ugly mess. A graceful degrade plan is better than random degradation, even if the plan something as simple as a manual intervention to disable features in a controlled manner rather than letting then fail in an uncontrolled manner.

On some applications we've been able to plan and execute graceful service degradation by disabling non-critical features. In one case, we disabled a scheduling widget in order to maintain sufficient headroom for more important functions like quizzing and exams, in other cases, we have the ability to limit the size of shopping carts or restrict financial aid and grade re-calcs during peak load.

Degraded operations isn't just an application layer concept. Network engineers routinely build forms of degraded operations into their designs. Networks have been congested since the day they were invented, and as you'd expect, the technology available for handling degraded operations is very mature. On a typical network, QOS (Quality of Service) policy and configuration is used to maintain critical network traffic and shed non-critical traffic.

As and example, on our shared state wide backbone, we assume that we'll periodically end up in some sort of degraded mode, either because a primary circuit has failed and the backup paths don't have adequate bandwidth, because we experience inbound DOS attacks, or perhaps because we simply don't have adequate bandwidth.  In our case, the backbone is shared by all state agencies, public colleges and universities, including state and local law enforcement, so inter-agency collaboration is necessary when determining what needs to get routed during a degraded state.

A simplified version of the traffic priority on the backbone is:

Highest Priority Router Traffic (BGP, OSPF, etc.)
  Law Enforcement
  Interactive Video
  Intra-State Data
Lowest Priority Internet Data

When the network is degraded, we presume that law enforcement traffic should be near the head of the queue. We consider interactive video conferencing to be business critical (i.e. we have to cancel classes when interactive classroom video conferencing is broke), so we keep it higher in the priority order than ordinary data. We have also decided that commodity Internet should be the first traffic to discarded when the network is degraded.

Unfortunately on the part of the application stack that's hardest to scale, the database, there is no equivalent to network QOS or traffic engineering.  I as far as I know, I don't have the ability to tag a query or stored procedure with a few extra bits that tell the database engine to place the query at the head of the work queue, discarding other less important work if necessary. It's not hard to imagine a 'discard eligible' bit that could be set on certain types of database processes or on work submitted by certain clients. The database, if necessary, would discard that work, or place the work in a 'best effort' scheduling class and run if if & when it has free CPU cycles.

If the engineers at the major database vendors would Google 'Weighted Fair Queuing' or 'Weighted Random Early Detect' we might someday see interesting new ways of managing degraded databases.

Creative Server Installs - WAN Boot on Solaris (SPARC)

Sun's SPARC servers have the ability to boot a kernel and run an installer across a routed network using only HTTP or HTTPS. On SPARC platforms, the (BIOS|Firmware|Boot PROM) can download a bootable kernel and mini root file system via HTTP/HTTPS, boot from the mini root, and then download and install Solaris. This allows booting a server across a local or wide area network without having any bootable media attached to the chassis. All you need is a serial console, a network connection, an IP address, a default gateway and a web server that's accessible from the bare SPARC server. You set a few variables, then tell it to boot. Yep, it's cool.

From the Boot PROM prompt (the SPARC equivalent of the BIOS)
OK> setenv network-boot-arguments host-ip=client-IP,

OK> boot net -v install

Our base Solaris install is fairly small - on the order of a few hundred megabytes - so booting across a WAN through a proxy or an SSH tunnel works pretty well. We usually build a temporary SSH tunnel from our management  infrastructure out to another server in the same security container and point the new server at the tunnel end point.

PXE is an attempt to provide similar functionality. It's got a dependency on having DHCP available on the deployed subnet, something which I'm absolutely do not want to enable on non-desktop networks, and it's based on UDP, which makes it slightly less suitable for booting across WAN's where packet loss might be an issue. In any case, we've had enough issues with network boots on x86/x64 platforms that we've pretty much defaulted to using bootable USB's or CD/DVD's for remote installs. That makes an x86/x64 deploy significantly more work effort, as we have to arrange for a bootable USB or CD/DVD's to be delivered on site, or we need to leave bootable media installed in production servers.

Linux has 'BKO', but as far as I can tell, it's still dependent on having either bootable media or PXE.

SPARC's Wan boot is pretty slick, but not as slick as Cisco's AutoInstall. AutoInstall allows you to drop ship an unconfigured router to a remote site. The router will learn it's IP address from it's upstream router via either SLARP or BootP,  automatically download a configuration file, and re-boot with a valid configuration.

A couple of closing thoughts:
  • If the SPARC platform ever goes away, I'll miss it.
  • If router engineers ever decide to build application servers, they'd probably come up with radically new ways of solving old problems. 

Tuesday, November 3, 2009

Pandemic Planning – The Dilbert Way

I normally don’t embed things in this blog, but this one is too good to pass up:

Deciding who is important is interesting.

Senior management wants to see a plan. Middle manager needs to decide who is important. If Middle Manager says only 8 of 20 are critical, what does that say about the other 12?  The only answer that most managers offer is ‘all my employees are critical to the enterprise’.

I’m assuming that many or most readers have been a part of some sort of pandemic planning. In our EDU system, the plan isn’t interesting because of the criticality of anything that we do. In a major pandemic, deadlines can be extended, semester start and end dates can be changed, faculty can adapt. It’s interesting because of what our facilities can do. In the rural towns served by many of our colleges, the campus is the best connected building in town. In many cases, our college serves as the local or regional backbone connection point for T1’s from other state agencies, some of which have critical public health, safety or law enforcement roles. I suspect some of those agencies are more important than an exam, lecture or quiz. It’s possible that for us, the critical resources in a pandemic might not have anything to do with education. HVAC, power, and routers might be the top priority.

Then there’s payroll. You’ve got to keep that going no matter what. Sick employees don’t have the energy to mess with bounced checks and overdrawn accounts.

Tuesday, October 27, 2009

No, we are not running out of bandwidth.

The sky is falling! the sky is falling!

Actually, we’re running out of bandwidth (PDF) (again).

Supposedly all the workers who stay home during the pandemic will use up all the bandwidth in the neighborhood. Let me guess, instead of surfing p0rn and hanging out on Reddit all day at work, they’ll be surfing p0rn and hanging out on Reddit all day from home.

The meat of the study:
“Specifically, at the 40 percent absenteeism level, the study predicted that most users within residential neighborhoods would likely experience congestion when attempting to use the Internet”
So what’s the problem?
“..under a cable architecture, 200 to 500 individual cable modems may be connected to a provider’s CMTS, depending on average usage in an area. Although each of these individual modems may be capable of receiving up to 7 or 8 megabits per second (Mbps) of incoming information, the CMTS can transmit a  maximum of only about 38 Mbps.”
Ooops – someone is oversubscribed just a tad. At least now we know how much.

Wait – 40% of the working population is at home, working or sick, bored to death, surfing the web. And they’ll be transferring large documents! Isn’t that what we call the weekend? So is the Internet broke on weekends? If so, I never noticed.

What about evenings? We  have a secondary utilization peak on our 24/7 apps around 10pm local time. That peak is almost exclusively people at home, working. Presumably this new daytime peak will dwarf the late evening peak?

Here’s a reason to panic. If it gets bad enough, the clowns who threw away ten trillion dollars of other peoples money on math they didn’t understand will not be able to throw away other peoples money while telecommuting:
“If several of these large firms were unable or unwilling to operate, the markets might not have sufficient trading volume to function in an orderly or fair way.”
My thought? Slow them down. When they flew at Mach 2, they smacked into a wall and took us with them.

Got to love this:
“Providers identified one technically feasible alternative that has the potential to reduce Internet congestion during a pandemic, but raised concerns that it could violate customer service agreements and thus would require a directive from the government to implement.”
Provider: “Yah know? I’d be cool if we could get the government to make us throttle that bandwidth. Yep, that’d be cool.”

How about Plan B – shut off streaming video:
“Shutting down specific Internet sites would also reduce congestion, although many we spoke with expressed concerns about the feasibility of such an approach.”
Wait – isn’t that what CDN’s are for? The Akamai cache at the local ISP has the content, all that matters is the last mile, right? For reference, with a few hundred thousand students working hard a surfing the web all day, we slurp up about 1/3 of our Internet bandwidth from the local Akamai rack directly attached to the Internet POP, (settlement free) and another 1/3 by peering directly with big content providers (also settlement free).

I’m not worried about bandwidth. If any of this were serious, we’d have been able to detect the effect of 10% unemployment on home bandwidth. Or the Internet would have broke during the 2008 election. Or what-his-names sudden death.

A more interesting potential outcome of a significant pandemic would be the gradual degradation of services as the technical people get sick and/or stay home with their families. I’d expect a significantly longer MTTR on routine outages during a real pandemic.

Would the cable tech show up at my house today, with two people flu’d out? Not if she’s smart.

  • The report is oriented toward the financial sector. The trades must go on. There are quarterly bonuses to be had.
  • The DHS commented on the draft in the appendices. They’ve attempted to inject a bit of rationality into the report.

Wednesday, October 14, 2009

Maintenance, Downtime and Outages

Via Data Center Knowledge - Maintenance, Downtime and Outages a quote from Ken Brill of The Uptime Institute:

"The No. 1 reason for catastrophic facility failure is lack of electrical maintenance,” Brill writes. “Electrical connections need to be checked annually for hot spots and then physically tightened at least every three years. Many sites cannot do this because IT’s need for uptime and the facility department’s need for maintenance downtime are incompatible. Often IT wins, at least in the short term. In the long term, the underlying science of materials always wins.”

Of course the obvious solution is to have redundant power in the data center & perform the power maintenance on one leg of the power at a time. One of our leased spaces does that. The other of our leased spaces has cooling and power shutdowns often enough that we have a very well rehearsed shutdown & startup plan. The point is well taken though. If you don’t do routine maintenance, you can complain about certain types of failure.

In IBM's case, it was the routine electrical maintenance that caused the outage. Apparently IBM didn't build out sufficient power redundancy for Air New Zealand's mainframe. A routine generator test failed and Air NZ’s mainframe lost power.

Air New Zealand CEO Rob Fyfe wasn't happy:

"In my 30-year working career, I am struggling to recall a time where I have seen a supplier so slow to react to a catastrophic system failure such as this and so unwilling to accept responsibility and apologise[sic] to its client and its client's customers,"

"We were left high and dry and this is simply unacceptable. My expectations of IBM were far higher than the amateur results that were delivered yesterday, and I have been left with no option but to ask the IT team to review the full range of options available to us to ensure we have an IT supplier whom we have confidence in and one who understands and is fully committed to our business and the needs of our customers."

I wonder if Air NZ contracted with IBM for a Tier 4 data center and/or a hot site with remote clustering? If so, Rob Fyfe has a point. If Air NZ went the cheap route he really can't complain. It's not like data centers don't get affected by storms, rats, power outages, floods & earthquakes. Especially power, power, and occasionally fire, fire or cooling. Oh yea, and don't forget storage failures.

In the Air NZ case, the one hour power outage seems to have resulted in a six hour application outage. If you spend any time at all thinking about MTTR (Mean Time To Repair), having an application suite that takes 5 hours to recover from a one hour power failure isn't a well thought out architecture for a service as critical as an airline check in/ticketing/reservation system.

Unfortunately the aftermath of a power failure can be brutal. Even in our relatively simple environment, we spend at least a couple hours cleaning up after a power related outage, generally for a couple of reasons:

  • It’s the 21st century and we still have applications that aren't smart enough to recover from simple network and database connectivity errors. This is beyond dumb. It should matter what order that you start servers and processes. I keep thinking that we need to make developers desktops & test servers less reliable, just so they'll build better error handling into their apps.
  • It’s the 21st century and we still have software that doesn’t crash gracefully. Dumb software is expensive (Google thinks so…).
  • The larger the server, the longer it takes to boot. In some cases, boot time is so bad that you can't have an outage less than an hour.
  • It’s the 21st century and we still have complex interdependent scheduled jobs that need to be restarted in a coordinated (choreographed) dance.

An amusing aside, as I was using an online note taking tool to rough in this post, the provider (Ubernote) went offline. They came back on line about 10 minutes later - with about 10 minutes of data loss.

Tuesday, October 13, 2009

Data Loss, Speculation & Rumors

My head wants to explode. Not because of Microsoft's catastrophic data loss. That's a major failure that should precipitate a significant numbers of RGE's at Microsoft.

My head wants to explode because of the speculation, rumors and mis-information surrounding the failure. An example: rumors reported by Daniel Eran Dilger in Microsoft’s Sidekick/Pink problems blamed on dogfooding and sabotage that point to intentional data loss or sabotage.

"Danger's existing system to support Sidekick users was built using an Oracle Real Application Cluster, storing its data in a SAN (storage area network) so that the information would be available to a cluster of high availability servers. This approach is expressly designed to be resilient to hardware failure."


"the fact that no data could be recovered after the problem erupted at the beginning of October suggests that the outage and the inability to recover any backups were the result of intentional sabotage by a disgruntled employee."


"someone with access to the servers at the datacenter must have inserted a time bomb to wipe out not just all of the data, but also all of the backup tapes, and finally, I suspect, reformatting the server hard drives so that the service itself could not be restarted with a simple reboot (and to erase any traces of the time bomb itself)."

Intentional sabotage?

How about some rational speculation, without black helicopters & thermite charges? How about simple mis-management? How about failed backup jobs followed by a failed SAN upgrade? I've had both backup and SAN failures in the past & fully expect to continue to have both types of failures again before I retire. Failed backups are common. Failed SAN's and failed SAN upgrades are less common, but likely enough that they must be accounted for in planning.

Let's assume as Dan speculates, it’s an Oracle RAC cluster. If all the RAC data is on a single SAN, with no replication or backups to separate media on separate controllers, then a simple human error configuring the SAN can easily result in spectacular failure.  If there is no copy of the Oracle data on separate media & separate controllers on a separate server, you CAN loose data in a RAC cluster. RAC doesn't magically protect you from SAN failure. RAC doesn't magically protect you from logical database corruption or DBA fat fingering. All the bits are still stored on a SAN, and if the SAN fails or the DBA fails, the bits are gone. Anyone who doesn't think that's true hasn't driven a SAN or a database for a living.

We've owned two IBM DS-4800's for the last four years and have had three controller related failures that could have resulted in data loss had we not taken the right steps at the right time (with IBM advanced support looking over our shoulders). A simple thing like a mis-matched firmware somewhere between the individual drives and the controllers or the HBA's and the operating system has the potential to cause data loss. Heck - because SAN configs can be stored on the individual drives themselves, plugging a used drive with someone else's SAN config into your SAN can cause SAN failure - or at least catastrophic, unintentional SAN reconfiguration.

I've got an IBM doc (sg246363) that says:

"Prior to physically installing new hardware, refer to the instructions in IBM TotalStorage DS4000 hard drive and Storage Expansion Enclosure Installation and Migration Guide, GC26-7849, available at:  [...snip...] Failure to consult this documentation may result in data loss, corruption, or loss of availability to your storage."

Does that imply that plugging the wrong thing into the wrong place at the wrong time can 'f up an entire SAN? Yep. It does, and from what the service manager for one of our vendors told me, it did happen recently – to one of the local Fortune 500’s.

If you don't buy that speculation, how about a simple misunderstanding between the DBA and the SAN team?

DBA: "So these two LUN's are on separate controllers, right?

SAN: "Yep."

Does anyone work at a place where that doesn't happen?

As for the ‘insider’ quote: "I don't understand why they would be spending any time upgrading stuff", a simple explanation would be that somewhere on the SAN, out of the hundreds of attached servers, a high profile project needed the latest SAN controller firmware or feature. Let's say, for example, that you wanted to plug a shiny new Windows 2008 server into the fabric and present a LUN from and older SAN. You'd likely have to upgrade the SAN firmware. To have a supported configuration, there is a fair chance that you'd have to upgrade HBA firmware and HBA drivers on all SAN attached servers. The newest SAN controller firmware that's required for 'Project 2008' then forces an across the board upgrade of all SAN attached servers, whether they are related to ‘Project 2008’ or not. It's not like that doesn't happen once a year or so, and it’s the reason that SAN vendors publish ‘compatibility matrixes’.

And upgrades sometimes go bad.

We had an HP engineer end up hospitalized a couple days before a major SAN upgrade. A hard deadline prevented us from delaying the upgrade. The engineer's manager flew in at the last minute, reviewed the docs, did the upgrade - badly. Among other f'ups, he presented a VMS LUN to a Windows server. The Windows server touched the LUN & scrambled the VMS file system. A simple error, a catastrophic result. Had that been a RAC LUN, the database would have been scrambled. It happened to be a VMS LUN that was recoverable from backups, so we survived.

Many claim this is a cloud failure.  I don't. As far as I can see, it's a service failure, plain and simple, independent of how it happens to be hosted. If the data was stored on an Amazon, Google or Azure cloud, and if the Amazon, Google or Azure cloud operating system or storage software scrambled and/or lost the data, then it'd be a cloud failure. The data appears to have been on ordinary servers in an ordinary database on a ordinary SAN.

That makes this an ordinary failure.

Tuesday, August 11, 2009

A Zero Error Policy – Not Just for Backups

In What is a Zero Error Policy, Preston de Guise articulates the need for aggressive follow up and resolution on all backup related errors. It’s a great read.

Having a zero error policy requires the following three rules:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.


I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.

I agree. This is a great summary of an important philosophy.

Don’t apply this just to backups though. It doesn’t matter what the system is, if you ignore the little warning signs, you’ll eventually end up with a major failure. In system administration, networks and databases, there is no such thing as a ‘transient’ or ‘routine’ error, and ignoring them will not make them go away. Instead, the minor alerts, errors and events will re-occur as critical events at the worst possible time. If you don’t follow up on ‘routine’ errors, find their root cause and eliminate them, you’ll never have the slightest chance of improving the security, availability and performance of your systems.

I could list an embarrassing number of situations where I failed to follow up on a minor event and had it cascade to a major, service affecting event. Here’s a few examples:

  • A strange undecipherable error when plugging a disk into an IBM DS4800 SAN. IBM didn’t think it was important. A week later I had a DS4800 with a hung mirrored disk set & a 6 hour production outage.
  • A pair of internal disks on a new IBM 16 CPU x460 that didn’t perform consistently in a pre-production test with IoZone. During some tests, the whole server would hang for minute & then recover. IBM couldn’t replicate the problem. Three months later the drives on that controller started ‘disappearing’ at random intervals. After three more months, a hundred person-hours of messing around, uncounted support calls and a handful of on site part-swapping fishing expeditions, IBM finally figured out that they had a firmware bug in their OEM’d Adapted RAID controllers.
  • An unfamiliar looking error in on a DS4800 controller at 2am. Hmmm… doesn’t look serious, lets call IBM in the morning. At 6am, controller zero dropped all it’s LUN’s and the redundant controller claimed cache consistency errors. That was an 8 hour outage.

Just so you don’t think I’m picking on IBM:

  • An HA pair of Netscaler load balancers that occasionally would fail to sync their configs. During a routine config change a month later, the secondary crashed and the primary stopped passing traffic on one of the three critical apps that it was front-ending. That was a two hour production outage.
  • A production HP file server cluster that was fiber channel attached to both a SAN and a tape library would routinely kick out a tapes and mark them bad. Eventually it happened often enough that I couldn’t reliably back up the cluster. The cluster then wedged itself up a couple times and caused production outages. The root cause? An improperly seated fiber channel connector. The tape library was trying really, really hard to warn me. 

In each case there was plenty of warning of the impending failure and aggressive troubleshooting would have avoided an outage. I ignored the blinking idiot lights on the dashboard and kept driving full speed.

I still end up occasionally passing over minor errors, but I’m not hiding my head in the sand hoping it doesn’t return. I do it knowing that the error will return. I’m simply betting that when it does, I’ll have better logging, better instrumentation, and more time for troubleshooting.

Tuesday, August 4, 2009

Content vs. Style - modern document editing

On ars technica,  Jeremy Reimer writes great thoughts on how we use word processing.

His description of modern document editing:

Go into any office today and you'll find people using Word to write documents. Some people still print them out and file them in big metal cabinets to be lost forever, but again this is simply an old habit, like a phantom itch on a severed limb. Instead of printing them, most people will email them to their boss or another coworker, who is then expected to download the email attachment and edit the document, then return it to them in the same manner. At some point the document is considered "finished", at which point it gets dropped off on a network share somewhere and is then summarily forgotten...
We use an application that was optimized to format printed documents in a world where printing is irrelevant, and our ‘document versioning’ is managed by the timestamps on the e-mail messages that we used to ‘collaborate’ on writing the document. What a mess, yet it's our perverse idea of what technology should be in the 21st century.

I'm sold on the idea of
  • online collaborative editing of documents
  • minimal formatting
  • continuous versioning
In other words I like wiki's. Some of my wiki docs are a decade old. I can find them. I can revert them back a decade if I want. I can rely on them in a DR event. I know who changed them & when they changed. I know what they contained before they were changed. They have bold, italics and headline fonts. I'm happy.

I'm even happier after I delete the hundred-odd useless fonts that come with my computers. I figure one or two each of serif, sans-serif and monospace is more than adequate. If I see more than a handful in the drop down font menu, I'm annoyed enough to start deleting them. We can thank Apple for that mess. The really cool people who bought early Mac’s needed to show off their GUI text editors by printing docs with six different font’s on a page (on a really crappy dot-matrix printer). It took them a while to figure out that it’s the content, not the style.

I'm really amused when archaic processes are updated by superficially skinning them over with technology.

True story, happens all the time:
  1. Senior manager with long title dictates memo to clerical staff.
  2. Clerical staff types memo in word processing software.
  3. Clerical staff prints memo.
  4. Senior manager signs memo.
  5. Clerical staff scans signed memo and saves as a PDF.
  6. Clerical staff e-mails memo to staff with subject line 'Please read attached memo from senior manager with long title'.
Someone isn't getting this whole technology thing. If the message from the senior manager with long title was really important, I'd have thought that it'd be in the opening paragraph of an e-mail from the senior manager with long title directly to the interested parties. If it were, I'd have read it instead of deleting it. It's the content that matters, not the container.

Equally amusing is the vast resources that we spend making web sites look pretty. It seems to me that the focus on a web site should be something like
  1. world class content
  2. decent writing style and readability
  3. make it look pretty
Instead we do something like:
  1. make it look pretty
  2. game the search engines
  3. optimize for ad revenue
  4. generate content (optional)
If you want me to read your content, don't waste your time making your site look pretty. I'll likely use a formatting tool to strip all that prettiness out anyway. That is – of course – if you have any interesting content amid all that prettiness.

Thursday, July 30, 2009

We Have Failed to Sufficiently Confuse our Users

Ran across this post. It’s a cute way of changing the color of your Firefox address bar.

Like this:


Which of course is not to be confused with this:


  Or this:


It seems like we haven’t yet sufficiently confused our users.

We need to try harder.

Quote Worth Remembering

"For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."

Personal observations on the reliability of the Shuttle

by R. P. Feynman

Wednesday, July 29, 2009

Infrastructure – Security and Patching

An MRI machine hosting Confliker:

“The manufacturer of the devices told them none of the machines were supposed to be connected to the Internet and yet they were […] the device manufacturer said rules from the U.S. Food and Drug Administration required that a 90-day notice be given before the machines could be patched.”

Finding an unexpected open firewall hole or a a device that isn’t supposed to be on the Internet is nothing new or unusual. If someone asked “what’s the probability that a firewall has too many holes” or “how likely is it that something got attached to the network that wasn’t supposed to be”, in both cases I’d say the probability is one.

Patching a machine that can’t be patched for 90 days after the patch is released is a pain. It’s an exception, and exceptions cost time an money.

Patching a machine that isn’t supposed to be connected to the Internet is a pain. I’m assuming that one would need to build a separate ‘dark net’ for the machines. I can’t imagine walking around with a CD and patching them.

Locating and identifying every operating system instance in a large enterprise is difficult, especially when the operating systems are packaged as a unit with an infrastructure device of some sort. Assuring that they all are patched is non-trivial. When vendors package an operating system (Linux, Windows) in with a device, they rarely acknowledge that you or they need to harden, patch, and update that operating system.

Major vendors have Linux and Windows devices that they refer to as ‘SAN Management Appliances’, ‘Enterprise Tape Libraries’, and ‘Management Consoles’. They rarely acknowledge that the underlying OS  needs to be hardened and patched, and sometimes even prohibit customer hardening and patching. The vendor supplies a ‘turnkey system’ or ‘appliance’ and fails to manage the patches on the same schedule as the OS that they embedded into their ‘appliance’.

This isn’t a Microsoft problem. Long before Windows was considered fit to be used for infrastructure devices (building controls, IVR, HVAC, etc) hackers were routinely root kitting the Solaris and Linux devices that were running the infrastructure. We tend to forget that though.

Tuesday, July 28, 2009

DOS, Backscatter, ATT and 4chan

If I’ve got the story straight, it goes something like this:

  1. is subject to a DDOS attack.
  2. 4chan attempts to block the DOS, and in the process, unintentionally DOS’s one or more of ATT customers.
  3. ATT ‘blocks’ 4chan.
  4. The world comes to and end.

Forget network neutrality, evil empires, Haliburton, black helicopters and aliens from area 51. Let’s try for a simple explanation, assuming that nobody is evil and everyone is incompetent. It’s speculative, but has the advantage of holding to Occams Razor far better than the other speculative explanations floating around.

  1. 4chan gets DDOS’d, presumably from spoofed IP addresses. For high profile sites, that’s a normal thing. For low profile sites with large address spaces (us), that’s a normal thing. The sun rises in the east, someone on the Internet gets DDOS’d from spoofed addresses. It’s been that way for a decade or so. Life goes on. Get used to it. If you think you can run a high profile site and not get DOS’d, they perhaps you need to rethink your career. Try accounting. Accountants rarely get DOS’d from spoofed addresses.
  2. 4chan responds by filtering packets. Unfortunately they appear to have filtered in a manner that sent ICMP destination unreachable or perhaps TCP RST or ACK packets back at the spoofed IP addresses. It’s called backscatter.
  3. One of the spoofed addresses is an ATT customer. From ATT’s point of view they see a DOS sourced from 4chan. They don’t know or care if it’s real or backscatter. ATT puts up with the DOS for awhile, weeks perhaps, and finally says screw it – null route the dumb SOB’s.
  4. The Internet panics, providing amusing news for what otherwise would have been a slow news weekend.

Let’s pretend you are an ATT operator or engineer.

If ATT’s ops had decided to follow up with the owner of the IP address, there is no reason the think they’d have got very far. Presumably someone like ATT is subjected to DDOS’s more or less continuously, so to expect them to follow up on every one is pretty unreasonable.

As of today, the IP address for the 4chan web server’s Whois points to a generic hosting provider with no rwhois. A PTR lookup ends in ‘Server Failed’. Even if they had followed up and tracked it back to 4chan, the whois for points at a generic Network Solutions phone number with no other useful information. Surfing to the IP would have worked, but the only contact info on the web site is a few e-mail addresses, and surfing to an unknown web site to see who owns it is pretty unsafe now days. There is no reason for them to think that 4chan is anything other than a mickey mouse, incompetently run BBS that’s badly hacked and originating a DOS against them. It’s not like they haven’t seen that a hundred times before. Back in the day when I though it would matter, I followed up on a thousand-odd hacked web servers that port scanned, DOS’d or tried to hack me. Twenty or so per day for a couple or more years. Trust me, it’s a bloody pain.

ATT’s network ops see a DOS from an IP address aimed at one of their  customers. They null route it. It goes away. They move on to the next crisis.

That’s what I’d do, and that’s what I’ve done a hundred times over the last decade.

The 4chan crowd think they are famous, and in their own minds they no doubt are, but I’ll bet that the vast majority of network engineers don’t know or care who 4chan is or what it does. They didn’t have a clue who 4chan is until Monday am., when they saw themselves in the news.

Saturday, July 25, 2009

You have Moved an Icon on Your Desktop

Your computer must be restarted for the change to take effect.

We used to joke about Windows 95 and it’s ridiculous reboot requirements.  The line we used was:

“You have moved an icon on your desktop. Windows must be restarted for the change to take effect.”

Those were not the days, and I thought that they were pretty much over.

Apparently not:


I can’t think of any circumstances where a reboot should be necessary to complete the installation of application software. That was last century. I’m OK with reboots for things like kernel updates, firmware updates, and perhaps even driver updates.

But a browser?

It’s possible that the reboot is being forced by non-Mozilla browser add ons – I don’t have any way of knowing. But if an add on to an application can force the application to force the operating system to reboot, then the application and OS design are both defective.

Wednesday, July 22, 2009

Off Topic: Stadium Construction Resumes (Update)

UPDATE: The Italian authorities have responded to the exposure of the corruption surrounding the construction of the stadium. The Minister of Stadiums has investigated the mismanagement and corruption, issued a report, and taken corrective action. The corrupt and incompetent officials have been identified, prosecuted and punished. Additionally, the authorities have agreed to resolve other outstanding issues that have prevented the completion of the project.



In an unusual turn of events, the authorities have issued severe consequences for the perpetrators of the scandal. Here you can see the  gruesome results of this most effective punishment. The statues of the wives of the corrupt officials have been decapitated (NSFW).

Shocking though it may be, this extreme punishment, reserved exclusively for the most severe offenses, has a long tradition in Italian culture. Over the millennia, many statues of spouses of famous officials have been similarly mutilated and placed for exhibition in great air conditioned halls. In many cases admission is charged for viewing the mutilated statues.

Improved Security

IMG_0464To prevent a reoccurrence of the material thefts, guards have been hired to prevent the misappropriation of construction materials. In this exclusive undercover photo, taken at great risk by staff reporters of this publication, guards in uniform are secretly training for their role in the protection of the project. The Minister of Stadiums has assured us that the guards have been trained to defend the construction materials using all means available. Unnamed officials have confirmed that in extreme cases, guards may be authorized to defend the construction materials with realistic plastic swords.

IMG_0472 To supplement the guards, the Ministry has installed special road materials on all paths leading from the construction site. These roads are cleverly designed to impart a vibration of a specific destructive frequency and amplitude to any Chrysler Fiat that passes over the roads. The  Chrysler Fiats are not expected to traverse more than half way up the path before requiring the intervention of a mechanic, giving the authorities sufficient time to apprehend the thieves.

Resumption of Construction

IMG_0465The security measures have been deemed sufficient to permit the resumption of construction on the stadium. New groups of highly skilled workers are restarting the project. These workers have been specially recruited from all parts of the world, arriving daily in great numbers by airplane, bus, car and train. As you can see in the photograph to the right, the Italian authorities have spared no expense in this area.

As shown in the photo below, the new workers have made great progress in the very short amount of time since the resumption of construction.  IMG_0467Scaffolding has already been erected and workers have restarted construction on the western facade.

Authorities have assured this reporter that the problems have been resolved and are optimistic that the new workers will make rapid progress on the stadium.

Enhancements to the Original Design

Additionally, according to stadium officials, significant improvements will be made to the original design. The changes are specifically designed to improve the usability and assure a steady revenue stream. Assurances have been given that the additions to the project scope will not increase the cost of the project nor delay its completion.

IMG_0460In the first enhancement, a new steel playing field is being built to replace the original wood and masonry field. The new field is a significant improvement to the original field and is expected to last as long as the construction project. The new playing field, shown here, is already partially constructed. Authorities claim the new field will result in less severe injuries and fewer weather related event cancellations. Additionally, the new playing field will permit the display of banner ads directly on the surface of the field.

IMG_0462As with modern stadiums, special seating will be built for sponsors and executives.  A premium will be charged for the seating and the proceeds will be used to finance the completion of the stadium. The special seating is expected to generate significant revenue toward the completion of the stadium. Pictured here is the entrance to one of the reserved sections. Notice that the executives and sponsors will be separated from the ordinary visitors by modern, unobtrusive security systems. Studies have shown that separate seating for sponsors and executives will result in higher revenue for the normal seating sections, as the plebeian attendance will not be adversely affected by the antics of the upper classes.

IMG_0476In the finest of classical Roman tradition, the perimeter of the stadium will be ringed with great works of art specially commissioned for this project. Authorities have commissioned a statue for each entry gate IV through XXIX.

One such work, shown here, represents a French interpretation of a Roman copy of a Greek original of the goddess Giunone, despondent over the transgressions of her consort Jupiter. The statue completed and is ready to be moved to it’s place near gate XXIV.

Other great works of art are ready for installation. IMG_0477Here you see a French copy of a Roman interpretation of a Greek original of the same goddess Giunone, after Jupiter came home from work and discovered her in a compromising position with an unnamed consort. This statue is scheduled to be installed near gate XVIII.

To demonstrate the sincerity of the officials, the Ministry of Stadiums has permitted an exclusive inside tour of the factory that produces the great works of art. Below is a photograph of the nursery where the great statues are grown. Viewing from right to left, you can clearly see the maturation of the statues from larvae through pupae, nymph and bewilderment stages.


Completion of the Stadium

Unnamed officials of the Ministry of Stadiums have confirmed that the newly revised schedule for the completion of construction is uncharacteristically aggressive. The officials have assured this reporter that although they have specified 64 bit time_t structs for the project management software, the project will be completed before they are required. Project specifications listed Unix 2038 time compatibility as a requirement only because of EU regulations.


There has been much written about the demise of mainstream newspapers and the effect on the future of investigative reporting. By the example of this report, one should rest assured that the new media stand ready to expose and document the corruption, inefficiencies and transgressions of governments throughout the world.

Oh – and if you haven’t figured it out yet, It’s a joke.

Tuesday, July 21, 2009

Off Topic: Stadium Construction Scandal


In the center of Rome are the remnants of large stadium. Tradition tells that the stadium was completed during the period of the Roman Empire and allowed to decay, unmaintained, during the centuries since construction.

This photograph, taken with a special filter and timed exactly as the planets Venus and Mars intersected a polyline bounded by the vertices of the tops of the arches and extending in to space, clearly shows that we have been mislead about the true origins of the facility. Careful analysis of the photo shows that the stadium is not a decaying ghost of a once great stadium, but rather it is a uncompleted, scandal ridden construction project gone bad.

By normal Italian standards, construction projects of this nature typically take decades. Corruption and mis-management are assumed, delays are inevitable.  But even by those standards, after nearly two millennia the stadium should have been completed. IMG_0471To give a comparative example, the construction on the shopping center in the photograph shown here was started at about the same time as the stadium, and as you can see, today it is nearly 90% complete.

How could it be, that after nearly two millennium of construction, the stadium is not yet complete? Through careful translation of long lost documents and inscriptions on pottery shards, the story of the stadium can finally be told.

The Scandal

The Emperor Flaviodius started building the stadium in 72 AD. Over time it became clear that the Emperor was not particularly adapted to managing large projects. Little did the emperor know that this would someday be the largest construction scandal in the western world.

Shortly after construction began, the ever generous Emperor Flaviodius made provisions for the entertainment of the construction workers. A large circus (Circus Maximus) was built near the site of the stadium. In the premier act of the circus, specially trained Christians performed great feats of daring with hungry lions. IMG_0451The entertainment was such that the workers spent much time in amusement and very little time working on the stadium. After a period of time, Flaviodius discontinued the entertainment, unfortunately without formal consultation with the workers bargaining units. The workers responded with a decades-long work slowdown, causing delays and cost overruns. Flaviodius eventually compensated the workers for the missing entertainment and work resumed.

Scandal again disrupted the schedule shortly after construction resumed. Emperor Flaviodius had to be removed from the project after a late night altercation with visiting Goths (Visigoths). The Emperor, shown above just after the altercation, suffered a broken nose and a bruised chin. IMG_0475The visiting Goths appear to the right in an undated photo, apparently unaware that what seemed to be a minor incident has delayed the largest construction project in Rome. The presence of the Goths, who are culturally adverse to large structures, appears to have caused a work stoppage lasting several centuries.

As with many large projects, revisions to the plans were frequent and seemingly random. Ancient sources indicate that later project managers authorized changes to the shape, orientation, color and number of tiers of seating. The scope changes resulted in a series of project extensions, forcing significant re-work and lost time. Additionally, the project documentation requirements were such that handwritten documentation was impossible to maintain, thereby bringing the project to a standstill until the printing press could be invented.

IMG_0469A major problem throughout the construction was the theft of building materials. When one walks through Rome today, one sees fragments of brick and marble originally purchased for the stadium randomly incorporated into the foundations of other, newer buildings. While theft is common in building projects, in this case it appears to have been on a grand scale. The building material theft caused the construction to stall for much of the period that we call the middle ages. The photograph above shows an example, even to day, of construction materials laying about completely unguarded.

Further delays  were apparently caused by a miscommunication between the powerful Italian construction worker unions and the authorities. Unnamed sources indicated that although the union declared a strike, the authorities failed to receive the notification due to the fact that the postal union was also on strike. Because the work on the construction had not noticeably slowed during the strike, it was several hundred years before the authorities noticed the walkout. Negotiations have yet to be restarted.

The investigation continued with calls and e-mails to unnamed officials. When presented with incontrovertible evidence of the scandal, few officials were willing acknowledge the corruption, mismanagement, tangente and incompetence, fearing that the resulting scandal would jeopardize their pension benefits.

Stay tuned for more information as the scandal of the stadium unfolds.

Sunday, July 12, 2009 – A Crash Course in Failure

One of the things we system managers dread the most is having the power yanked out from under our servers, something that happens far too frequently (and hits the news pretty regularly). Why? Because we don't trust file systems and databases to gracefully handle abnormal termination. We've all had or heard of file system and database corruption just from a simple power outage. Servers have been getting the power yanked out from under them for five decades, and we still don't trust them to crash cleanly? That's ridiculous. Five decades and thousands of programmer-years of work effort ought to have solved that problem by now. It’s not like it’s going to go away anytime in the next five decades.

In A Crash Course in Failure, Craig Stuntz discuses the concept of building crash only software – or software for which a crash and a normal shutdown are functionally equivalent.

“Hardware will fail. Software will crash. Those are facts of life.”
"…if you believe you have designed for redundancy and availability, but are afraid to hard-fault a rack due to the presence of non-crash-only hardware or software, then you're fooling yourself."
"…maintain savable user data in a recoverable state for the entire lifecycle of your application, and simply do nothing when the system restarts."
“…it is sort of absurd that users have to tell software that they would like to save their work. In truth, users nearly always want to save their work. Extra action should only be required in the unusual case where the user would like to throw their work away.”
Why shouldn't continuous and automatic state saving be the default for any/all applications? A CAD system I bought in 1984 did exactly that. If the system crashed or terminated abnormally, the post-crash reboot would do a complete 'replay' of every edit since the last normal save. In fact you'd have to sit and watch every one of your drawing edits in sequence like a VCR on fast forward, a process that was usually pretty amusing in a Keystone Cops sort of way. It can't be that hard to write serialized changes to the end of the document & only re-write the whole doc when the user explicitly saves the doc or journal every change to another file. That CAD system did it twenty-five years ago on on 4mhz CPU and 8" floppies. Some applications are at least attempting to gracefully recover after a crash, a step in the right direction. It certainly is not any harder than what Etherpad does- and they are doing it multi-user, real time, on the Internet.
“Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. Once you accept that failures will happen, you have the ability to design your system's reaction to specific failures. … If you do not design your failure modes, then you will get whatever unpredictable---and usually dangerous---ones happen to emerge.” -- Michael Nygard

A Crash Course in Failure, Craig Stuntz
Design your Failure Modes, Michael Janke
'Everything will ultimately fail', Michael Nygard

Wednesday, July 8, 2009

Error Handling – an Anecdote

A long time ago, shortly after the University I was attending migrated students off of punch cards, I had an assignment to write a batch based hotel room reservation program. We were on top of the world - we had dumb terminals instead of punch cards. The 9600 baud terminals were reserved for professors, but if you got lucky, [WooHoo!] you could get one of the 4800 baud terminals instead of a 2400 or 1200 baud DECwriters.

The instructors mantra - I'll never forget - is that students need to learn how to write programs that gracefully handle errors. 'You don't want an operator calling at 2am telling you your program failed. That sucks.' He was a part time instructor and full time programmer who got tired of getting woke up, and he figured that we needed our sleep, so he made robustness part of his grading criteria.

Here's how he made that stick in my mind for 30 years:  When the assignment was handed to us, the instructor gave us the location of sample input data files to use to test our programs. The files were usually laced with data errors. Things like short records, missing fields and random ASCII characters in integer fields were routine, and we got graded on our error handling, so students quickly learned to program with a healthy bit of paranoia and lots of error checking.

That was a great idea and we learned fast. But here's how he caught us all: A few hours before the assignment was due, the instructor gave us a new input file that we had to process with our programs, the results of which would determine our grade.

What was in the final data file?

……[insert drum roll  here]……

Nothing. It was a zero byte file.

Try to picture this - the data wasn’t available until a couple hours before the deadline, it was a frantic dash to get a terminal (long lines of students on most days, especially at the end of the semester), edit the source file to gracefully handle the error and exit (think ‘edlin’ or ‘ed’ ), submit it into the batch queue for the compiler (sometimes that queue was backed up for an hour or more) and re-run it against the broken data file, all by the deadline.

How many students caught that error the first time? Not many, certainly not me. My program crashed and I did the frantic thing. The rest of the semester? We all had so dammed many paranoid if-thens in our code you'd probably laugh if you saw it.

He was teaching us to think about building robust programs - to code for what goes wrong, not just what goes right. For him this was an availability problem, not a security problem. But what he taught is relevant today, except the bad guys are feeding your programs the data, not your instructor. That makes it a security problem.

I can't remember the operating system or platform (PDP-something?), I can't remember the language (Pascal, I think, but we learned SNOBOL and FORTH in that class too, so it could have been one of those), but I'll never forget that !@$%^# zero byte file!

Sunday, July 5, 2009

Sometimes Hardware is Cheaper than Programmers

In Hardware is Expensive, Programmers are Cheap II I promised that I’d give an example of a case where hardware is cheap compared to designing and building a more efficient application. That post pointed out a case where a relatively small investment in program optimization would have paid itself back by dramatic hardware savings across a small number of the software vendors customers.

Here’s an example of the opposite.

Circa 2000/2001 we started hosting an ASP application running on x86 app servers with a SQL server backend. The hardware was roughly 1Ghz/1GB per app server. Web page response time was a consistent 2000ms. Each app server could handle no more than a handful of page views per second.

By 2004 or so, application utilization grew enough that the page response time and the scalability (page views per server per second) were both considered unacceptable. We did a significant amount of investigation into the application, focusing first on the database, and then on the app servers. After a week or so of data gathering we determined that the only significant bottleneck was a call to an XSLT/XML transformation function. The details escape me – and aren’t really relevant anyway, but what I remember is that most of the page response time was buried in that library call, and that call used most of the app server CPU. Figuring out how to make the app go faster was pretty straightforward.

  • The app servers were CPU bound on a single library call.
  • The library wasn’t going to get re-written or optimized with any reasonable work effort. (If I remember correctly, it was a Microsoft provided library, the software developers only option would and been a major re-write).
  • The servers were somewhere around 4 years old and due for a routine replacement.
  • The new servers would clock 3x as fast, have better memory bandwidth and larger caches. The CPU bound library call would likely scale with processor clock speed, and if it fit in the processor cache might scale better than clock.

Conclusion: Buy hardware. In this case, two new app servers replaced four old app servers, the page response time improved dramatically, and the pages views per server per second went up enough to handle normal application growth. It was a clear that throwing hardware at the problem was the simplest, cheapest way to make it go away.

In The Quarter Million Dollar Query I outlined how we attached an approximate dollar cost to a specific poorly performing query. “The developers - who are faced with having to balance impossible user requirements, short deadlines, long bug lists, and whiny hosting teams complaining about performance - likely will favor the former over the latter.”

Unless of course they have data comparing hardware, software licenses and hosting costs to their development costs. My preference is to express the operational cost of solving a performance problem in ‘programmer-salaries’ or ‘programmer-months’. Using units like that helps bridge the communication gap.

My conclusion in that post: “To properly prioritize the development work effort, some rational measurement must be made of the cost of re-working existing functionality to reduce [server or database] load verses the value of using that same work effort to add user requested features.”


The Quarter Million Dollar Query
Hardware is Cheap, Programmers are Expensive
Hardware is Expensive, Programmers are Cheap
Hardware is Expensive, Programmers are Cheap II

Friday, July 3, 2009

Cisco IOS hints and tricks: What went wrong: end-to-end ATM

I enjoy reading Ivan Pepelnjak's Cisco IOS hints and tricks blog. Having been a partner in a state wide ATM wide area network that implemented end to end RSVP, his thoughts on What went wrong: end-to-end ATM are interesting.

I can' figure out how to leave a comment on his blog though, so I'll comment here:
I'd add a couple more reasons for ATM's failure.

(1) Cost. Host adapters, switches and router interfaces were more expensive. ATM adapters used more CPU, so larger routers were needed for a given bandwidth.

(2) Complexity, especially on the LAN side. (On a WAN, ATM isn't necessarily more complex than MPLS for a given functionality. It might even be simpler).

(3) 'Good enough' QOS on ethernet and IP routing. Inferior to ATM? Yes. Good enough? Considering the cost and complexity of ATM, yes.

Ironically, core IP routers maintain a form of session state anyway (CEF).
On an ATM wide are a network, H.323 video endpoints would connect to a gatekeeper and request a bandwidth allocation for a video call to another endpoint (384kbps for example). The ATM network would provision a virtual circuit and guarantee the bandwidth and latency end to end. There was no 'best effort'. If bandwidth wasn't available, rather than allowing new calls to overrun the circuit and degrade existing calls, the new call attempt would fail. If a link failed, the circuit would get re-routed at layer 2, not layer 3. Rather than  band-aid-add-on QoS like DSCP and priority queuing, ATM provided reservations and guarantees.

It was a different way of thinking about the network.

Friday, June 19, 2009

No, I Don’t Want iTunes Installed. You can quit asking.

I don’t like software vendors that try to sneak software onto my computers. I really don’t like software vendors that don’t pay attention to my requests to not run in the background at startup.

This evening I came home and saw the Apple Software Update popped up on my Vista desktop:


Problem one: iTunes is check marked by default. I don’t want iTunes. I don’t need iTunes. And I don’t like having software vendors try to sneak  software onto my computers.  This isn’t unique to Vista. Apple does the same thing on OS X. It’s annoying enough that I’ll probably uninstall Quicktime and throw away the $29 that I paid for it.

Problem two: I specifically instructed Apple’s Quicktime to not automatically update, and I specifically have disabled the Quicktime service from running at startup, but somehow it ran anyway.

 Apple-Update-1 Quicktime-startup

I’ve also checked the Software Explorer in Windows Defender and the ‘Run’ registry keys for Apple related startup programs & didn’t find any. I’d sure like to know what’s triggering the Apple updater so I can nuke it.

Something makes me think that the only way I’ll get rid of this malware infestation is to search and destroy all Apple related registry keys.

Friday, May 29, 2009

Availability & SLA’s – Simple Rules

From theDailyWtf, a story about availability & SLA’s that’s worth a read about an impossible availability/SLA conundrum. It’s a good lead in to a couple of my rules of thumb.

“If you add a nine to the availability requirement, you’ll add a zero to the price.”

In other words, to go from 99.9% to 99.99% (adding a nine to the availability requirement), you’ll increase the cost of the project by a factor of 10 (adding a zero to the cost).

There is a certain symmetry to this. Assume that it’ll cost 20,000 to build the system to support three nines, then:

99.9 = 20,000
99.99 = 200,000
99.999 = 2,000,000

The other rule of thumb that this brings up is

Each technology in the stack must be designed for one nine more than the overall system availability.

This one is simple in concept. If the whole system must have three nines, then each technology in the stack (DNS, WAN, firewalls, load balancers, switches, routers, servers, databases, storage, power, cooling, etc.) must be designed for four nines. Why? ‘cause your stack has about 10 technologies in a serial dependency chain, and each one of them contributes to the overall MTBF/MTTR. Of course you can over-design some layers of the stack and ‘reserve’ some outage time for other layers of the stack, but in the end, it all has to add up.

Obviously these are really, really, really rough estimates, but for a simple rule of thumb to use to get business units and IT work groups thinking about the cost and complexity of providing high availability, it’s close enough. When it comes time to sign the SLA, you will have to have real numbers.

Via The Networker Blog

More thoughts on availability, MTTR and MTBF:

NAC or Rootkit - How Would I know?

I show up for a meeting, flip open my netbook and start looking around for a wireless connection. The meeting host suggests an SSID. I attach to the network and get directed to a captive portal with an ‘I agree’ button. I press the  magic button an get a security warning dialogue.

NAC-Rootkit It looks like the network is NAC’d. You can’t tell that from the dialogue though. ‘Impluse Point LLC’ could be a NAC vendor or a malware vendor. How would I know? If I were running a rouge access point and wanted to install a root kit, what would it take to get people to run the installer?  Probably not much. We encourage users  to ignore security warnings.

Anyway – it was amusing. After I switched to my admin account and installed the ‘root kit’ service agent and switched back to my normal user, I got blocked anyway. I’m running Windows 7 RC without anti-virus. I guess NAC did what it was supposed to do. It kept my anti-virus free computer off the network.

I’d like someone to build a shim that fakes NAC into thinking I’ve got AV installed. That’d be useful.

Thursday, May 28, 2009

Consulting Fail, or How to Get Removed from my Address Book

Here’s some things that consultants do that annoy me.

Some consultants brag about who is backing their company or whom they claim as their customers. I’ve never figured that rich people are any smarter than poor people so I’m not impressed by consultants who brag about who is backing them or who founded their company. Recent ponzi and hedge fund implosions confirm my thinking. And it seems like the really smart people who invented technology 1.0 and made a billion are not reliably repeating their success with technology 2.0. It happens, but not predictably, so mentioning that [insert famous web 1.0 person here] founded or is backing your company is a waste of a slide IMHO.

I’m also not impressed by consultants who list [insert Fortune 500 here] as their clients. Perhaps [insert Fortune 500 here] has a world class IT operation and the consultant was instrumental in making them world class. Perhaps not. I have no way of knowing. It’s possible that some tiny corner of [insert Fortune 500 here] hired them to do [insert tiny project here] and they screwed it up, but that’s all they needed to brag about how they have [insert Fortune 500 here] as their customer and add another logo to their power point.

I’m really unimpressed when consultants tell me that they are the only ones who are competent enough to solve my problems or that I’m not competent enough to solve my own problems. One consulting house tried that on me years ago, claiming that firewalling fifty campuses was beyond the capability of ordinary mortals, and that If we did it ourselves, we’d botch it up. That got them a lifetime ban from my address book. They didn’t know that we had already ACL’d fifty campuses, and that inserting a firewall in line with a router was a trivial network problem, and that converting the router ACL’s to firewall rules was scriptable, and that I already written the script.

I’ve also had consultants ‘accidently’ show me ‘secret’ topologies for the security perimeters of [insert fortune 500 here] on their conference room white board. Either they are incompetent for disclosing customer information to a third party, or they drew up a bogus whiteboard to try to impress me. Either way I’m not impressed. Another lifetime ban.

Consultants who attempt to implement technology or projects or processes that the organization can’t support or maintain is another annoyance. I’ve see people come in and try to implement processes or technologies that although they might be what the book says or what every one else is doing, aren’t going to fit the organization, for whatever reason. If the organization can’t manage the project, application or technology after the consultant leaves, a perceptive consultant will steer the client towards a solution that is manageable and maintainable. In some cases, the consultant obtained the necessary perception only after significant effort on my part with the verbal equivalent of a blunt object.

Recent experiences with a SaaS vendor annoyed me pretty badly when they insisted on pointing out how great their whole suite of products integrate, even after I repeatedly and clearly told them I was only interested in one small product, and they were on site to tell me about that product, and nothing else. “I want to integrate your CMDB with MY existing management infrastructure, not YOUR whole suite. Next slide please. <dammit!>”. Then it went down hill. I asked them what protocols they use to integrate their product with other products in their suite. The reply: a VPN. Technically they weren’t consultants though. They were pre-sales.

That’s not to say that I’m anti consultant. I’ve seen many very competent consultants who have done an excellent job. At times I’ve been extremely impressed.

Obviously I’ve also been disappointed.