Last In - First Out: January 2009

A terminated system admin attempted massive data deletion with a script that would have attempted to wipe out all disks on all servers.

"If this script were executed, the total damage would include cleaning out and restoring all 4,000 ABC servers, restoring and securing the automation of mortgages, and restoring all data that was erased."

It was detected before it could execute.

Think about your conversion from tape-based backups to disk based backups. Would that script have wiped out the disk pools that store your most recent backups?

Unless you have clear separation of duties and rights between system admins who support production and those who administer the backup software and servers, this would be a tough risk to mitigate. I‘ll bet that most shops have the same system admins for both the production servers and the backup infrastructure. If so, the rogue ‘wipe all’ script would take out the disk-based backups also.

That’d be a bad day.

In response to Hardware is Cheap, Programmers are Expensive at Coding Horror:

The million-dollar question: What’s wrong with this picture?

20% CPU utilization, that’s what’s wrong. It’s way too low.

The hardware that’s running at 20% on a busy day is 32 cores of IBM’s finest x3950 series servers and a bunch of terabytes of IBM’s DS4800 storage. The application has three of them (active/passive and remote DR) at a total cost of about $1.5 m. That’s right, $1.5 million in database hardware running less than 20% CPU utilization on a normal day and barely 30% CPU on the busiest day of the year.

How did that happen?

Because a software vendor, run by programmers, thought that they were too expensive to design an efficient and optimized application. Instead, they spent their precious and valuable time adding shiny new features. So the customer had no choice but to buy hardware. Lots of it. Then – after the hardware was bought, the software vendor figured out that they actually could write an application that was efficient and optimized, and that their customers couldn’t buy enough hardware to compensate for their poor programming.

Too late though. The hardware was already bought.

The app in question was delivered with a whole series of performance limiting design and coding flaws. The worst of them:

No session caching combined with a bug that forced two database updates to the same session state table for each session state change (several hundred updates/second and a really, really nasty page latch issue)
Broken connection pooling caused by poor application design, forcing app servers to log in & out of the database server several hundred times per second.
Session variables not cached, forcing database round trips for user information like language, home page customizations, background colors, etc., once per component per web page. Thousands per second.
Failure to properly parameterize SQL calls, forcing hundreds of SQL recompilations per second of the same dammed friggen query. And of course, filling up the procedure cache with nearly infinite query/parameter combinations.
Poorly designed on screen widgets & components, some of which used up 30% of 32 database cores all by themselves.
A design that prevents anything resembling horizontally scaled databases.
(the whole list wouldn’t fit in a blog post, so I’ll quit here….)

After suffering nasty performance and system outages, and after spending tens of thousands of dollars on consulting and tens of thousands on Tier 3 Microsoft support, and after discovering the above flaws and reporting them to the software vendor, the customer was advised to buy more hardware. Lots of it.

The database server growth went something like this:

8 CPU, 100% busy. 100% growth per year. 18 month life
16 CPU, 80% busy. 50% growth per year. 6 month life
32 Core (16 dual cores), 50% busy. 30% growth per year.

Throw in Microsoft Database licenses, Microsoft Data Center Edition software and support, IBM storage (because HP wouldn’t support Data Center Edition on an EVA), and it’s not hard to see seven figures getting thrown at the database server production cluster and failover severs. Oh, we shouldn't forget to add in expenses for additional power, cooling and floor space in the datacenter.

Fast forward a few years, a handful of application code upgrades, and a million and a half hardware dollars later.

Beautifully designed session and user variable caching, intelligent enough to only cache what it needs and only use the database when is has to.
Fully optimized widgets.
Minimal SQL recompilations.
An optimized data model.
An efficient, well running application.
A pleasure to host.
And 30% peak CPU.

Had the app been as efficient three years ago as it is today, I'm estimating that about half of what was spent on hardware and related licensing and support costs would not have been necessary. They would not have had to buy Datacenter Edition when they did, if at all. Existing EVA's would have been supported, eliminating the need to buy IBM storage. Overall support and licensing costs would have been much lower, but more importantly, they would have been on the downhill side of Moore's law instead of climbing uphill against it. Realistically, they still would have bought hardware, but they'd have bought it later and gotten faster bits for fewer dollars.

If each of the optimizations and bug fixes that the software vendor applied as part of the last 4 years of software upgrades been available only six months earlier than they were, the customer still would have saved a pile of money. That six month acceleration probably would have been enough time to allow them to wait for dual-core processors to come out instead of buying single-cores and then upgrading them to dual-cores six months later, and the dual-cores would still have lasted until quad-cores came out. That would have allowed the customer to stick with eight-socket boxes and save a programmer’s salary worth of licensing and operating system costs.

What’s the worst part of all this?

There’s lots of ‘worst’ parts of this.

More than one customer had to burn seven figures compensating for the poor application. There is at least one more customer at the same scale that made the same hardware decisions.

And:

The customer detected and advised the vendor of potential solutions to most of the above problems. The vendor’s development staff insisted that there were no significant design issues.

And:

The vendor really didn’t give a rat’s ass about efficiency until they started hosting large customers themselves. When they figured out that hardware was expensive, their programmers suddenly were cheap enough to waste on optimization.

And:

The dollars burned are not the customers. They are yours. Taxes & tuition paid for it all.

Hardware is expensive, Programmers are cheap.

In this case, a couple of customers burned something like 10 programmer-years worth of salary on unnecessary hardware, when the cost to optimize software was clearly an order of magnitude lower than the cost to compensate with hardware.

To be fair though, I’ll post another example of a case where hardware was cheaper than programmers.

Hardware is Expensive, Programmers are Cheap

Last In - First Out

Rogue Sysadmin Sabotage Attempt

Hardware is Expensive, Programmers are Cheap II

How did that happen?

What’s the worst part of all this?

A Simple Solution, Well Executed

Hardware is Expensive, Programmers are Cheap