Last In - First Out: 2008

If it can browse the Internet, it cannot be secured

Tired of IE’s vulnerabilities?

You could switch to Firefox, but if you were honest, you’d have to admit that you still can’t declare yourself secure. Or you could try Opera, but then you’d have to manage critical patches also, though perhaps less frequently. There is nothing about Chrome or Safari that indicates that using them will make you secure. They may have fewer vulnerabilities, or it may be that fewer of their vulnerabilities have been discovered and published. You may be more vulnerable or less vulnerable by switching browsers, but you will still be vulnerable. Throw in cross platform vulnerabilities and the combined vulnerabilities of the various third party browser addons & the menu looks pretty bleak.

Frankly, as the threats from the Internet have evolved over the last decade or so, I’ve not seen a huge difference between the security profiles of the various browsers. Some have fewer vulnerabilities, some have more; some have an easier selection of somewhat more secure browsing modes, others are more difficult to configure reasonably securely. None, as far as I can tell, are bug free, hardened, or easily configurable in a manner that is sufficiently secure such that ordinary users can fearlessly browse the Internet. There are differences between the browsers, and I have a strong preference for one browser, but fundamentally the choices are only that of relative security, not absolute security. The most popular browser likely has the most problems, but it also is the biggest target. When or if a less used browser that currently appears to be more secure ends up the most widely distributed browser, it’s pretty safe to assume that it will be targeted and it will get hit, and the results will be more or less the same.

Even if you could build a perfectly secure browser, you still have the infamous simian keyboard-chair interface that will routinely click on the banner ad that installs malicious fake ‘security’ software or stumble upon widely distributed malicious content. I don’t think it is possible to secure that particular interface using current technology.

My conclusion is simple:

If it can browse the Internet, it cannot be secured.

Start with that premise. The security model that you begin to derive is significantly different than where we are today.

Startups and Early Adopting Customers

Any product, no matter who's it is, will only meet a fraction of your needs. Working with early startups gives you the ability to influence a product early in its life cycle and increase that fraction. You get to nudge a product in a direction that matters to you, while the startup gets unvarnished, raw, but valuable product feedback.

In a recent post on Security for All, Joseph Webster describes risk to innovation that startups face when transitioning to established corporations. From the point of view of a customer of startups, the transition that the startup needs to make is also interesting:

“In a small startup everyone is intimately familiar with the customers, whereas large corporations have to make concerted efforts to allow a design engineer to even have marginal contact with a customer - and that’s usually second hand through either a sales or marketing initiative.^[1]”

As a customer, I’ve seen both ends of the spectrum. My team was one of the early customers of LogLogic back when the founder, VP, and the guy who came on site and racked up the appliance were one and the same. We had pretty good response from them on bug & feature requests, sometimes overnight, at least until they hooked up with some vastly larger customers. Once they hooked up with large customers, the startup correctly judged that focusing on the large customers would keep them in business.

On the big corporation side, Microsoft has a free customer-developer program called ‘Frontline’, where Microsofts own engineers and developers come on site to see how their customers are using their products. This gives individual developers & engineers access to customers in an informal one-on-one setting, where in theory each can learn from the other. We’ve had a couple of Microsoft SQL database developers on site as a part of that program and are planning on doing it again. The feedback loop in a program like that isn’t anywhere near as direct though.

There are also times that for some reason the early customers of a startup don’t seem to be able to influence the product. We are one of the earliest and largest customers of a small software company. It is surprising how little influence we seem to have on moving critical features and functionality forward within that company, even when we team up with the rest of the large customers and present a unified, prioritized feature request list. They are small and agile startups, so they should be able to do much better than they are.

Many of us are big enough that we probably can influence a small startup if we get connected up early enough, but we are small enough that we’ll lose our influence once the startup catches a big customer, hires executives and moves to Silicon Valley. At some point in time, as Webster indicates, the lead engineer no longer calls on customers directly and at best can only focus on a few of the largest, if any. The feedback loop from customer to engineer to product isn’t simple and direct anymore, and the ability to move the product in a direction that solves your problems declines.

The transition point is about when you start getting sales calls from someone you’ve never met and who can’t pronounce your name.

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:

In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These differences make certain failure modes much more difficult to design around.

Janke’s Official 2009 Technology Predictions

I’ll take Anton’s bait.

Here they are:

Prediction 1: The rate of adoption of IPV6 will greatly accelerate. Estimates of the final shutdown date for the last v4 global route will be moved up from ‘when hell freezes over’ to ‘long after I’m retired’, placing the problem right next to the Year 2038 Unix timestamp problem on CTO’s priority lists.

Prediction 2: Gadget freaks will continue to search for the holy grail of multifunction all-in-one gadgets. They will continue to be disappointed.

Prediction 3: Apple will announce a new product. The product will generate a media frenzy. Apple fans will crash servers looking for the latest product leaks or fuzzy prototype pics, and arguing via blog comments the merits of the features the product may or may not have. Unfortunately, the product will be missing cut and paste.

Prediction 4: Hardware and network vendors will continue making faster and cheaper bits at a rate that matches Moore's law. Software will continue to bloat at a rate just slightly faster than Moore's law, ensuring that state-of-the-art software running on new hardware will be slightly slower than last year.

Prediction 5: Disks will double in capacity. The average file size will double. The number of files stored will also double. All hard drives on the planet will continue to be 95% full. No progress will be made toward identifying the owner, data classification, or destruction date of the files.

Prediction 6: There will be a major security panic over some widely used but inherently insecure Internet protocol. The problem will not get resolved.

Prediction 7: Touch screen devices will continue to collect fingerprints.

Prediction 8: Sun Microsystems will rename two of their core technologies, ensuring that their loyal customers will remain confused.

Prediction 9: Web Apps will continue to be deployed with a 1:1 ratio of new web applications to applications that are vulnerable to SQL injection, XSS or XSRF. A few new applications will not be vulnerable. The rest will make up for those few with multiple vulnerabilities, keeping the overall ratio constant.

Prediction 10: Virtualization will explode, replacing hundreds of thousands of real servers with virtual servers. Unfortunately, the number of virtual servers will grow so fast that the number of physical servers will not decrease, and all datacenters on the planet will continue to have cooling and power problems.

Prediction 11: Endless e-mail threads will continue to replace mindless meetings as the preferred venue for designing, building and maintaining complex systems. After-hours meetings at local brew pubs will continue to be the actual venue for designing, building and maintaining complex systems.

And – For the bonus prediction – Someone, somewhere will figure out how to define cloud computing. The rest of us will argue over the definition for at least another year.

Notice how I didn’t stick my neck out on any of these predictions?

The Power Consumption of Home Electronics

I learned something last week. Xbox and PlayStation Game consoles are pathetically bad at energy consumption. The Wii doesn’t suck (power) quite as badly.

The Data:

The Natural Resources Defense Council did an interesting study^[1] of game consoles and attempted to estimate annual energy usage and cost.

The good part:

Ouch. Unlike half watt wall warts, a hundred and some odd watts might actually show up on your monthly electric bill. And from what NRDC can tell, the game consoles are not real good at powering themselves off when unused, which makes the problem worse.

This is really discouraging. The idea that energy consuming devices should automatically drop themselves down into a low-power state when idle isn’t new, yet we continue to build (and buy) devices with poor power management. I suspect that part of the problem is that there isn’t sufficient information available to consumers at the time of purchase to make a rational ‘green’ decision. Unlike refrigerators, clothes washers, and automobiles (here in the USA), energy consumption isn’t part of the marketing propaganda of most home electronics.

It should be.

Someday smart retailers will figure out how to market energy costs on home electronics, much like they already do for large home appliances. For my last clothes washer/dryer (tumbler, to those on the wrong side of the pond) the sales dude tried to push me up to a higher cost model based on features. When I explained that for clothing related appliances, my feature requirements were a step above a rock in a river, he wisely and quickly pointed me to an expensive but efficient washer & dryer model.

Sold.

As for the report as a whole, I’m skeptical of the annual gross energy costs and savings shown in the report, mostly because the estimates are highly dependent on user actions. I suspect that we really don’t know how many game consoles are left on continuously versus powered down after each use, and more importantly, the NRDC doesn’t consider the cost of cooling the heat generated by the consoles in those parts of the country where air conditioning normally is used.

So if you are like my neighbors and you leave your air conditioner running all summer, your summertime gaming costs will be much higher. The hundred plus watts of heat needs more than a hundred plus watts of cooling. But if like me, you live in a climate where heating is the norm for more than half the year, the waste heat generated by the console gets subtracted from the heat that your furnace needs to supply, making the cost of gaming somewhat less.

In any case, don’t sweat the wall warts. Look around for things that suck up a hundred or more watts and unplug those.

12/17/2010: Scientific American published a similar article.

^[1]Lowering the Cost of Play, Natural Resources Defense Council

Amusing log messages

I give Cisco credit for fully documenting firewall log messages. In theory this gives users the ability to set up a system for catching interesting log messages and ignoring uninteresting messages. More vendors should be so bold as to actually acknowledge that their products log messages, and that those messages need to be documented.

This level of disclosure has an interesting side effect. I'm not sure what I'd do if one of our ASA's logged this error:

Error Message %ASA-2-716515:internal error in: function: OCCAM failed to allocate memory for AK47 instance

Explanation The OCCAM failed to allocate memory for the AK47 instance.

Or this error:

Error Message %ASA-2-716508: internal error in: function: Fiber scheduler is scheduling rotten fiber. Cannot continuing terminating

Explanation The fiber scheduler is scheduling rotten fiber, so it cannot continue terminating.

Fiber rot?

An AK47 instance?

No doubt those messages mean something to someone at the TAC. For the rest of us, they are mostly just amusing.

The Cloud – Outsourcing Moved up the Stack

Why is outsourcing to ‘the cloud’ any different than what we’ve been doing for years?
The answer: It isn’t.
We’ve been outsourcing critical infrastructure to cloud providers for decades. This isn’t a new paradigm, it’s not a huge change in the way we are deploying technology. It’s pretty much the same thing we’ve always been doing. It’s just moved up the technology stack.
We’ve been outsourcing layer 1 forever (WAN circuits), layer 2 for a couple decades (frame relay, ATM, MPLS), and sometimes even layer 3 (IP routing, VPNs’) to cloud providers. Now we have something new – outsourcing layers 4 through 7 to a cloud provider.

So we are scratching our heads trying to figure out what this ‘new’ cloud should look like, how to fit our apps into a cloud and what the cloud means^[1] for security, availability and performance. Heck we’re not even sure how to patch the cloud^[2], or even who is responsible for patching a cloud.
I’ll argue that outsourcing CPU, database or storage to a server/application cloud isn’t fundamentally different than outsourcing transport to an MPLS cloud, as practically everyone with a large footprint is already doing. In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider.
What would happen if we took a look at the parts of our infrastructure that are already outsourced to a cloud provider and see if we can apply lessons learned from layers 1 through 3 to the rest of the stack.

Lesson 1: The provider matters. We use both expensive Tier 1 providers and cheap local transport providers for a reason. We have expectations of our providers and we have SLA’s that cover among other things, availability, management, reporting, monitoring, incident handling and contract dispute resolution. When a provider fails to live up to SLA’s, we find another provider (See Lesson 3). If we’ve picked the right provider, we don’t worry about their patch process. They have an obligation to maintain a secure, available, reliable service, and when they don’t, we have means to redress the issue.

Lesson 2: Design for failure. We provision multiple Tier 1 ISP’s to multiple network cores for a reason. The core is spread out over 4 sites in two cities for a reason. We have multiple providers, multiple paths of 10 Gig’s between the cores for a reason. We use two local providers to each hub for a reason. The reason is – guess what – sh!t happens!. But we know that if we do it correctly, we can loose a 10 Gig connection to a Tier 1 and nobody will know, because we designed for failure. And when we decide to cut costs and use a low cost provider or skimp on redundancy, we accept the increased risk of failure, and presumably have a plan for dealing with it.

Lesson 3: Deploy a standard technology. We don’t care if our MPLS providers use Juniper, Cisco or Extreme for layer 2 transport, because it doesn’t matter. We don’t deploy vendor specific technology, we deploy standardized interoperable technology. We all agree on what a GigE handoff with jumbo MTU’s and single mode long reach lasers looks like. It’s the same everywhere. We can bring in a new ISP or backbone transport provider, run the new one in parallel to the old, seamlessly cut over to the new, and not even tell our customers.

What parallels can we draw as we move the cloud up the stack?

My provider doesn’t prioritize my traffic (CPU, memory, disk I/O): Pay them for QoS. Priority bits (CPU cycles, I/O’s) cost more that ‘best effort’ bits (CPU cycles, I/O’s). They always have and always will.

My provider doesn’t provide reliable transport (CPU, Memory, Operating Systems, App Servers, Databases): Get a Tier 1 network provider (cloud provider), or get two network providers (cloud providers) and run them in parallel.

My provider might not have enough capacity: Contract for burst network (CPU, I/O) capacity. Contract and pay for the ability to determine which bits(apps) get dropped when oversubscribed. Monitor trends and anticipate growth and load, and add capacity proactively.

My provider might go bankrupt or have catastrophic failure of some sort: You’ve got a plan for that, right? They call it a backup network provider (cloud host). And your app is platform and technology neutral so you can seamlessly move your app to the new provider, right?

My provider might not have a secure network (Operating System, Database): Well, you’ll just have to encrypt your traffic (database) and harden you edge devices (applications) against the possibility that the provider isn’t secure.

Instead of looking back at what we are already doing and learning from what we’ve already done, we are acting like this is something totally new. It isn’t totally new.
It’s just moved up the stack.
The real question: Can the new top of stack cloud providers match the security, availability and reliability of the old layer 1-2-3 providers?

^[1]Techbuddha, Cloud Computing, the Good, The Bad, and the Cloudy, Williams
^[2] Rational Survivability, Patching The Cloud?, Hoff

The Patch Cycle

The patch cycle starts again, this time with a bit of urgency. A 'patch now' recommendation has hit the streets for what seems to be an interesting Windows RPC bug.

What does 'patch now' mean this time? Hopefully it means a planned, measured and tested patch deployment, but at an accelerated schedule.

It's a Microsoft patch, and that's a good thing. The routine of monthly Microsoft security patches has been honed to a fine art in most places, making Windows OS patches by far the simplest and most trouble free of the platforms that we manage. This one appears to be no exception, at least so far.

Just for grins I drew up a picture of what a typical Microsoft Windows patch cycle looks like. The patch kits show up once a month. Most months have at least one 'important' patch, so most monthly patches get applied. Life is easier if you can fit the patch cycle into a one month window, just because the probability of missing a patch or patching out of order is greatly reduced, even if the WSUS toolkit simplifies the process to the point where it's pretty foolproof.
The Microsoft Windows patch cycle typically looks something like this:

It's more or less a month long cycle that sometimes drags out to more than a month, and occasionally even drags on far enough that we roll two months into one. We deviate from the linear plan somewhat, because we have servers that manage the infrastructure that we patch sooner, and we have less critical applications that we patch early, leaving the most critical applications for last. There are also obnoxious, clueless application vendors that don't support patched servers, so some of those get held back also.

Once a year or so, a critical vulnerability shows up. In an extreme case, the patch cycle follows pretty much the same path, but with an accelerated time line, something like this:

That's a fast time line, but in Windows-land the process is practiced often enough that even an accelerated time line is fairly low risk. In this strange world practice makes perfect, and nobody has more practice at patching that Windows sysadmins.

Compare this to another platform, one without the well honed, routine, trouble free patching system that Microsoft has developed.

There are a whole bunch of those to choose from, so let's randomly pick Oracle, just for the heck of it. Here's what a typical Oracle patch time line looks like:

Can you see the difference?

Maybe that's why so many DBA's don't patch Oracle.

Missing the Point

ExtremeTech reviewed the new Fit-PC Slim.

Conclusion:

CompuLabs really needs to step up to a more modern platform if it wants to stay competitive in the rapidly growing market for small, net-top PCs.

They missed the point. It's not a "net-top" or desktop replacement, it's an extremely low wattage home server.

The spec that matters:

Power consumption: 4-6W

Compare that to the 50-100w of typical desktops that are used as home servers & left running 24 hours per day, or the 20+ watts of a typical notebook. Even an Eee PC uses 15 watts.

If what you need is a home server to use as a samba share, a web server or similar always-on device, a 5 watt brick looks pretty interesting. That's 500kwh/yr less power, 400kg less CO2, and $50 less on your electric bill per year than the old desktop-turned-server that you have stuffed under your desk.

And don't whine about the 500mhz processor and 500mb RAM. We ran LAMP stacks that served up more users than your house ever will on a quarter of that.

Wide Area Network Outage Analysis

The following is an brief analysis of unplanned network outages on a large state wide network with approximately 70 sites at bandwidths from DS3 to GigE. The data might be interesting to persons who need to estimate expected availability of wide area networks.

The network is standard core, hub, spoke, leaf. The core is fully redundant. The hubs have redundant circuits connecting to multiple hubs or cores, redundant power and partially redundant hardware. The spokes and leaf sites are non-redundant.

The source or the data was a shared calendar where outages were recorded as calendar events. The data was gathered at analysis time and is subject to omissions and misinterpretation. Errors likely are undercounts.

Raw data, by approximate cause

88 Total Outages
290 Total Hours of Outage
2 years calendar time

Failures by type and duration

Cause	# of Incidents	Percent	# of Hours	Percent
Circuit Failures	34	39%	168	58%
Equip Failures	24	66%	60	79%
Power Failures	22	91%	53	97%
Unknown	5	97%	7	99%
Other	3	100%	2	100%

Total	88		290

Column Definitions

# of Incidents	=	Raw count of outages affecting one or more sites
# of Hours	=	Sum of duration of outages affecting one or more sites
Percent	=	Cumulative Percentage of corresponding column

Cause Definitions

Circuit Failures	=	Failures determined to be circuit related, primarily fiber cuts
Equip Failures	=	Failures determined to be router, firewall or similar
Power Failures	=	Failures where site power was cause of outage
Unknown	=	Failure cause undetermined, missing information
Other	=	All other failures

Pareto Chart - Number of Incidents

A visual representation of the failures shows causes by number of outages. If I remember my statistical process control training from 20 years ago, a Pareto chart is the correct representation for this type of data. The chart shows outage cause on the X-axis, outage count on the left Y-axis and cumulative percent of outages on the right Y-axis.

Using the Pareto 80/20 rule, solving circuit failure resolves 40% of outages by count. Solving equipment failures resolves another 25%. Solving power failures resolves another 25% of outages.

Power failures are probably the least costly to resolve. Long running UPS's are inexpensive. The individual sites supply power and UPS for network equipment at the leaf sites. The sites have variable configurations for power and UPS run times. The area has frequent severe weather, so power outages are common.

Circuit failures are the most expensive to solve. Circuits have high on going costs compared to hardware. The sites are already configured with the lowest cost available carrier, so redundant or protected circuits tend to be more costly than the primary circuit. Circuit failures also appear to be more frequent in areas with rapid housing growth, construction and related activity. For fiber paths provisioned above ground, storm related failures are common.

Pareto Chart - Hours of Outage

A representation of total outage duration in hours by cause is also interesting.

When considering the total number of hours without service, the causes occur in the same relative order. Solving circuit failures resolves 60% of the total outage hours. Circuit outages have a disproportionate share of total outage duration, likely because circuit failures take longer to resolve (MTTR is higher).

Availability Calculations

The network is composed of approximately 70 sites (the number varies over time). The time frame of the raw data is approximately two years. The numbers are approximations.

Outage Frequency:

70 sites * 2 years = 140 site-years.
88 outages /140 site-years = .6 outages/year.
140 site-years / 88 outages = 1.6 years MTBF

Sites should expect to have slightly less than one unplanned outage per year on average, over time. Caution is advised, as the nature of this calculation precludes using it to predict the availability of a specific site.

Outage Duration:

Availability is calculated simply as

(Hours Actually Available)/(Hours Possibly Available)

70 sites * 2 years * 8760 hours/year = 1.23m Hours possible
1.23m hours -288 hours = Hours actually available
(1.23m hours -288 hours )/(1.23m hours )= 99.95% availability.

Availability on average should be three nines or better.

This syncs up fairly well with what we've intuitively observed for sites with non-redundant networks. Our seat of the pants rule is that a non-redundant site should expect about 8 hours unplanned outage per year. We assume that Murphy's Law will make the failure on the most critical day of the year, and we expect that areas with rapid housing development or construction will have more failures.

This also is consistent with service provider SLA’s. In most cases, our providers offer 99.9% availability SLA’s on non-redundant, non-protected circuits.

A uniquely regional anomaly is the seasonal construction patterns in the area. Frost depth makes most underground construction cost prohibitive for 5 months of the year, so construction related outages tend to be seasonal.

The caveat of course, is that some sites may experience much higher or lower availability than other sites.

Related posts: Estimating the Availability of Simple Systems

There are some things about computers I really don’t miss…

There are some things about computers I don’t think I’m ever going to miss. Nostalgia has limits.

I’m not going to miss:

Programming machine tools using paper tape and a Flexowriter, and copying the paper tape to Mylar tape for production. But only if it was a good program, one that didn't drill holes in the wrong place on an expensive casting or smash the machine tool spindle into the tooling fixture and break really expensive stuff.

Submitting a punch card deck to the mainframe operators, waiting four hours for the batch scheduler to compile and run the program, only to find a syntax error. Especially for a required assignment the last week of the semester.

Waiting for a goofy little homemade PDP-8 to assemble, link and load a 50 line assembler program (about 40 minutes of watching tape cartridges spin, if memory serves.)

Booting CAD/CAM systems by toggling bits and loading instructions from front panel switches. And then programming complex machine tools using a strange path description language, a pen plotter, a teletype, and punched tape. State of the art, at the time. The plotter even waited for you to put a new pen in every time it needed to draw a line in a new color.

Running CAD/CAM systems from floppies. (A CAD system that could do 3D wire frame views no less). Floppies though, were a vast improvement over paper or magnetic tape. You could access the data on them randomly. Amazing.

NetWare 2.0a server kernels, each one built from object modules custom linked specifically for the hardware using a linker and modules spread out over boxes of floppies, some of which had to be inserted more than once, and dozens of menu choices, including IRQ's, I/O ports, and memory addresses. If any of them were wrong, the kernel didn't boot. If the kernel didn't boot, you started over with disk #1. If it DID boot, you bought a round at the local pub, because life was good, and celebration was required.

NetWare client installations, when the Netware drivers were custom linked to match the I/O, IRQ and memory jumpers on the network card. Move a jumper to avoid an IRQ conflict and you'll have to re-run the linker and generate a new driver.

Using NetWare BRGEN to build routers, and linking the kernel of a four-port Arcnet router made out of an old XT and using it as the core router of a campus network. It worked though, and best of all I didn't have to walk across the building to manage departmental servers. And yes, it was possible to allocate IRQ's and I/O ports for four Arcnet cards in a single PC.

CGA graphics. The choices were four colors at 320x200 pixels, or two colors at 640x200 pixels(!). For serious CAD/CAM graphics, the 640x200 black & white mode was the preferred choice.

Endless hours spent moving jumpers on ISA cards, trying to get all the video, memory and interface cards to work together without IRQ, I/O port and memory address conflicts.

ROM BASIC

Electronic Typewriters. The ones that cost you two weeks of wages and had one whole line of memory.

Even more hours spent trying to get the drivers for the interface cards all stuffed into 640k and still have enough memory left to run AutoCAD or Ventura Desktop Publishing.

Recovering lost files from damaged floppies by manually editing the file allocation table. (Norton Utilities Disk Editor!)

TSR's.

Writing programs and utilities using 'copy con' and debug scripts copied from magazines.

Abort, Retry, Fail?

Early Mac fans and their dot-matrix printed documents that had eight different fonts on a page. Just because you can…doesn’t mean you should….

Sneakernet.

Running Linux kernel config scripts, hoping that the dozens of choices that you made, not knowing what most of them meant, would compile and link a bootable kernel. (Bootable kernels tended to be far more useful than non-bootable kernels).

Installing Linux from Floppies.

Patching Linux kernel source code to block Ping-of-Death so the primary name server would stay up for more than five minutes.

Editing X config files, specifying graphics card and monitor dot-clock settings, hoping that your best guess wouldn't smoke your shiny new $2000 NEC 4D 16" monitor.

OS/2 installations from 30-odd floppy disks, then another 20 or so for the service pack or PTF (or whatever they called it). CD-ROMs were expensive.

Broadcast storms.

I’m pretty sure that a hole bunch of the things we do today will look pretty archaic a decade or two from now. So what’s this list is going to look like twenty years from now?

Bank of America SafePass Authorization

Unlike American Express, Bank of America seems to have pretty decent account claiming, user id and password requirements. Additionally, BofA allows account holders to set up SMS alerts on various types of account activity.

The login process can be tied to SafePass^® SMS based authentication. To complete the login process, BofA sends a six digit code to your cell phone. The code and your normal password are both required for access to your on line account.

Additionally, BofA automatically uses the SMS based SafePass^® for changes to the account, including alerts, e-mail address changes, account claiming etc. You also can set up your account to send SMS alerts on significant account activity and any/all changes to account profiles, including on line charges, charges greater than a specific amount and international charges.

The user id and passwords are also allowed to be significantly more complex than American Express, allowing more than 8 characters and permitting various non-alphanumeric characters.

Your Online ID:

Must be 6 to 32 characters.
Can also contain these characters: @ # % * ( ) + = { } / ? ~ ; , . – _
Can contain all letters, otherwise must be a combination of 2 character types (Alpha, numeric & special)
Cannot contain spaces.
Cannot be the same or contain your Social Security number or Check Card number.

Your Passcode:

Must be between 8 - 20 characters
Must include at least 1 number and 1 letter
Can include uppercase and lowercase letters
Can contain the following characters: @ # % * ( ) + = { } /\ ? ~ ; : " ' , . - _ |
Cannot contain any spaces Cannot contain the following characters: $ < > & ^ ! [ ]
Cannot be the same as your Online ID

These features, plus the availability of merchant specific temporary credit card numbers (ShopSafe^®) makes the banking experience appear to be much closer to what one would think was needed for 21st century banking.

Trivial Account Reset on American Express Accounts (Updated)

2008-10-06 Update: I did eventually get an e-mail notice sent to the e-mail associated with the account about 6 hours after I reset my password.It still looks to me like the account can be hijacked, and the password restrictions and suggested examples are pathetic.

Account claiming is an interesting problem. The tradeoffs necessary to balance ease of use, security and help desk call volume are non-trivial.

2008-10-05 9:59 PM:

I'm a bit disappointed how easy it was to recover online access to my American Express account.

Enter the card number
Enter the four digit card ID number on the front of the card
Enter my mothers maiden name

That's all you need. The first two numbers are obtainable by possession of the card, the third is readily available from on line searches. Enter those three bits of info and you get a screen with your user name and the option to set a new password. Set up a new password and you have full access, including the ability to request new cards, change e-mail and billing addresses, etc. Go ahead and reset your password, but whatever you do, don't let the password be more than 8 characters or contain

"spaces or special characters (e.g., &, >, *, $, @)"

That makes choosing a password tough. My normal &mex$uck$ password will not work. But fortunately for me, the help screens on picking a new password contain useful examples:

Examples of a valid password are: snowman4, 810main, and year2k."

Never mind that whole dictionary thing. Nobody will ever guess a password like 'year2k'.

The Amex account is set up to send me an SMS alert for any 'Irregular Account Activity'. I did not get an SMS, even though on line recovery of both userid and password would certainly be worth an SMS in my book.

There are better ways of doing this. They could have asked me for some secret number that only exists on my last statement, or information on recent card activity, or perhaps like my health care provider, the account reset could generate a letter with a token, sent to my home address via good old fashioned postal mail.

Essential Complexity versus Accidental Complexity

This axiom by Neal Ford^[1] on the 97 Things wiki:

It’s the duty of the architect to solve the problems inherent in essential complexity without introducing accidental complexity.

should be etched on to the monitor of every designer/architect.

The reference is to software architecture, but the axiom really applies to designing and building systems in general. The concept expressed in the axiom is really a special case of ‘build the simplest system that solves the problem', and is related to the hypothesis I proposed in Availability, Complexity and the Person Factor^[2]:

When person-resources are constrained, highest availability is achieved when the system is designed with the minimum complexity necessary to meet availability requirements.

Over the years I’ve seen systems that badly violate the essential complexity rule. They’ve tended to be systems that were evolved over time without ever really being designed, or systems where non-technical business units hired consultants, contractors or vendors to deliver ‘solutions’ to their problems in an unplanned, ad-hoc manner. Other systems are ones that I built (or evolved) over time, with only a vague thought as to what I was building.

The worst example that I’ve seen is a fairly simple application that essentially manages a lists of objects and equivalencies (i.e object ‘A’ is equivalent to object ‘B’) and allows various business units to set and modify the equivalencies. The application builds the lists of equivalencies, allows business units to update them and push them out to a web app. Because of the way that the ‘solutions’ were evolved over the years by the random vendors and consultants, just to run the application and data integration it requires COBOL/RDB on VMS; COBOL/Oracle/Java on Unix; SQL server/ASP/.NET on Windows; Access and Java on Windows; DCL scripts, shell scripts and DOS batch files. It’s a good week when that process works.

Other notable quotes from the axiom:

…vendors are the pushers of accidental complexity.

And

Developers are drawn to complexity like moths to flame, frequently with the same result.

The first quote relates directly to a proposal that we have in front of us now, for a product that will cost a couple hundred grand to purchase and implement, and likely will solve a problem that we need solved. The question on the table though, is ‘can the problem be solved without introducing a new multi-platform product, dedicated servers to run the product, and the associated person-effort to install, configure and manage the product?’

The second quote also applies to people who like deploying shiny new technology without a business driver or long term plan. One example I’m familiar with is a virtualization project that ended up violating essential complexity. I know of an organization that deployed a 20 VM, five node VMware ESX cluster complete with every VMware option and tool, including VMotion. The new (complex) environment replaced a small number of simple, non-redundant servers. The new system was introduced without an overall design, without an availability requirement and without analysis of security, cost, complexity or maintainability.

Availability decreased significantly, cost increased dramatically. The moth met the flame.

Perhaps we can generalize the statement:

~~Developers~~ Geeks are drawn to complexity like moths to flame, frequently with the same result.

A more detailed look at essential complexity versus accidental complexity in the context of software development appears in The Art of Unix Programming^[3].

Neal Ford, “Simplify essential complexity; diminish accidental complexity [97 Things] ”, 97 Things,
Michael Janke, “Last In - First Out: Availability, Complexity and the Person-Factor,” http://blog.lastinfirstout.net/2008/03/availability-complexity-and-person.html.
Eric Steven Raymond, The Art of Unix Programming (http://www.catb.org/esr/writings/taoup/html/graphics/taoup.pdf: Eric Steven Raymond, 2003), pp 339-343.

Unplug Your Wall Warts and Save the Planet?

Do wall warts matter?

(09/29-2008 - Updated to correct minor grammatical errors. )

Let's try something unique. I’ll use actual data to see if we can save the planet by unplugging wall transformers.

Step one – Measure wall wart power utilization.
Remember that Volts x Amps = Watts, and Watts are what we care about. Your power company charges you for kilowatt-hours. (One thousand watts for one hour is a kWh).

Start with one clamp-on AC ammeter, one line splitter with a 10x loop (the meter measures 10x actual current) and one wall wart (a standard Nokia charger for an N800).

And we have zero amps on the meter.

OK - That meter is made for measuring big things, so maybe I need a different meter.

Lesson one

Wall warts don't draw much current. They don't show up on the ammeters' scale even when amplified by a factor of 10.

Try again - this time with an in-line multimeter with a 300mA range.

Children - don't try this at home - unless you are holding on to your kid brother and he is properly grounded to a water pipe.

(just kidding.....)

That's better. It looks like we have a couple milliamps current draw.

Try a few more. The ones I have (Motorola, Samsung, Nokia) are all pretty close to the same. Lets use a high estimate of 5mA @ 120v, or about a half of a watt.

Similarity, checking various other parasitic transformers, like notebook computer power bricks, yields currents in the low milliamp ranges. When converted to watts, the power draw for each brick is somewhere between one-half and two watts.

To make sure the numbers are rational and that I didn't make a major error somewhere, I did the simplest check of all. I placed my hand on the power bricks. When they are plugged into the wall and nothing is plugged into them, they are not warm. (Warm = watts).

One more sanity check. Plugging three notebook power supplies and three phone power supplies into a power strip shows about 30mA @ 120v for all six bricks, which is under 4 watts, or less than a watt each. My measurements are rough (I don't have a proper milliamp meter), but for estimates for a blog that nobody actually reads, they should be close enough.

So let's pretend that I want to save the planet, and that unplugging power bricks is the way I'm going to do it. I'll need to periodically plug them in to charge whatever they are supposed to charge. Let's assume they'll be plugged in for 4 hours per day and unplugged 20 hours per day. If I have a half dozen power bricks, I'll save around 5 watts x 20 hours = 100 watt-hours per day, or the amount of electricity that one bright light bulb uses in one hour. That would add up to 35kWh (Kilowatt-hours) per year. Not bad, right?

Until you put it into perspective.

Perspective
Let's take the other end of the home appliance spectrum. The clothes dryer (clothes tumbler to those on the damp side of the pond). That one is a bit harder to measure. The easiest way is to open up the circuit breaker box and locate the wires that go to the dryer.

Hooking up to the fat red wire while the dryer is running shows a draw of about 24 amps @ 220 volts. I did a bit of poking around (Zzzztttt! Oucha!!.....Damn...!!!) and figured out that the dryer, when running on warm (verses hot or cold) uses about 20 amps for the heating element and about 4 amps for the motor. The motor runs continuously for about an hour per load. The heating element runs at about a 50% duty cycle for the hour that the dryer is running on medium heat.

Assume that we dry a handful of loads per week and that one load takes one hour. If the motor runs 4 hours/week and the heating element runs half the time, or two hours per week, we'll use about a dozen kWh per week, or about 600 kWh per year. That's about the same as 100 wall warts.

How about doing one less load of clothes in the dryer each week? You can still buy clothes lines at the hardware store - they are over in the corner by the rest of the End-of-Life merchandise, and each time that you don't use your clothes dryer, you'll save at least as much power as a wall wart will use for a whole year.

Lets do another quick check. Say that I have a small computer that I leave run 24 hours per day. Mine (an old SunBlade 150 that I use as a chat and file server) uses about 60 watts when powered up but not doing anything. That's about 1.4 kWh per day or about 500kWh per year, roughly the same as my clothes dryer example and roughly the same as 100 wall warts. Anyone with a gaming computer is probably using twice as much power. So how about swapping it out for a lower powered home server?

Notebooks, when idling with the screen off, seem to draw somewhere between 15 and 25 watts. (Or at least the three that I have here at home are in that range). That's about half of what a low-end PC draws and about the same as 25 wall warts. Using a notebook as your home server will save you (and the planet) far more than a handful of wall warts. And better yet, the difference between a dimly lit notebook screen and a brightly lit one is about 5 watts. Yep - that's right, dimming your screen saves more energy than unplugging a wall wart.

Make this easier!

How about a quick and dirty way of figuring out what to turn off without spending a whole Sunday with ammeters and spreadsheets? It's not hard.

If it is warm, it is using power.

The warmer it is, the more power it uses. (Your laptop is warm when it is running and cold when it is shut off, right?). And if you can grab onto it without getting a hot hand, like you can do with a wall wart, (and like you can't do with an incandescent light bulb) it isn't using enough electricity to bother with.

The CO²

So why do we care? Oh yeah - that global warming thing. Assuming that it's all about the CO², we could throw a few more bits into the equation. Using the CO² calculator at the National Energy Foundation in the UK and some random US Dept of Energy data, and converting wall warts to CO²at a rate of 6kWh per wall wart per year and 1.5lbs of CO² per kWh, it looks like you'll generate somewhere around 4kg of CO² per year for each wall wart, +/- a kg or two, depending on how your electricity was generated.

Compare that to something interesting, like driving your car. According to the above NEF calculator and other sources, you'll use somewhere around a wal-wart-years worth of CO² every few miles of driving. (NEF and Sightline show roughly 1kg of CO² every two miles of driving). On my vacation this summer I drove 6000 miles and probably used something like 3000kg of CO². That's about 700 wall-wart-year equivalents (+/- a couple hundred wwy's).

Take a look at a picture. (Or rather ... take a look at a cheezy Google chart with the axis labels in the wrong order....)

Can you see where the problem might be? (Hint - It's the long bright blue bar)
Obviously my numbers are nothing more than rough estimates. But they should be adequate to demonstrate that if you care about energy or CO², wall warts are not the problem and unplugging them is not the solution.

Should you unplug your wall warts?

You can do way better than that!

Disclaimer: No wall warts were harmed in the making of this blog post. Total energy consumed during the making of the post: 5 - 23 watt CFL bulbs for 2 hours = 230 watt-hours; 5 - 25 watt incandescent bulbs for 1/2 hour = 62.5 watt-hours; one 18 watt notebook computer for 3 hours = 54 watt-hours; one 23 watt notebook for 3 hours = 69 watt-hours; Total of 415 watt-hours, or 28.8 wall-wart-days. Any relationship between numbers in this blog post and equivalent numbers in the real world is coincidental. See packaging for details.

The Path of Least Resistance Isn’t

09/29-2008 - Updated to correct minor grammatical errors.

When taking a long term view of system management

The path of least resistance is rarely the path that results in the least amount of work.

As system managers, we are often faced with having to trade off short term tangible results against long term security, efficiency and stability. Unfortunately when we take the path of least resistance and minimize near term work effort, we often are left with systems that will require future work effort to avoid or recover from performance, security and stability problems. In general, when we short cut the short term, we are creating future work effort that ends up costing more time and money than we gained with the short term savings.

Examples of this are:

Opening up broad firewall rules rather than taking the time to get the correct, minimal firewall openings, thereby increasing the probability of future resource intensive security incidents.
Running the install wizard and calling it production, rather than taking time to configure the application or operating system to some kind of least bit, hardened, secured, structured configuration.
Deferring routine patching of operating systems, databases and applications, making future patching difficult and error prone and increasing the probability of future security incidents.
Rolling out new code without load and performance testing, assuring that future system managers and DBA's will spend endless hours battling performance and scalability issues.

Another way of thinking of this is that sometimes 'more work is less work'; meaning that often times doing more work up front reduces future work effort by more than the additional initial work effort. I learned this from a mechanic friend of mine. He often advised that doing what appeared to be more work often ended up being less work, because the initial work effort paid itself back at the end of the job. For example - on some vehicles, removing the entire engine and transmission to replace the clutch instead of replacing it while in the car is less work overall, even though it appears to be more work. With the engine and transmission in the car, the clutch replacement can be a long, tedious knuckle busting chore. With everything out of the car, it is pretty simple.

In the world of car collectors, a similar concept is called Deferred Maintenance. Old cars cost money to maintain. Some owners keep up with the maintenance, making the commitments and spending the money necessary to keep the vehicles well maintained. They 'pay as they go'. Other owners perform minimal maintenance, only fixing what is obviously broke, leaving critical preventative or proactive tasks undone. So which old car would you want to buy?

In the long run, the car owners who defer maintenance are not saving money, they are only deferring the expense of the maintenance until a future date. This may be significant, even to the point where the purchase price of the car is insignificant compared to the cost of bringing the maintenance up to date. And of course people who buy old collector cars know that the true cost of an old car is the cost of purchasing the car plus the cost of catching up on any deferred maintenance, so they discount the purchase price to compensate for the deferred maintenance.

In system and network administration, deferred maintenance takes the form of unhardened, unpatched systems; non-standard application installations, ad-hoc system management, missing or inaccurate documentation, etc. Those sort of shortcuts save time in the short run, but end up costing time in the future.

We often decide to short cut the short term work effort, and sometimes that's OK. But when we do, we need to make the decision with the understanding that whatever we saved today we will pay for in the future. Having had the unfortunate privilege of inheriting systems with person-years of deferred maintenance and the resulting stability and security issues, I can attest to the person-cost of doing it right the second time.

Privacy, Centralization and Security Cameras

The hosting of the Republican National Convention here in St Paul has one interesting side effect. We finally have our various security and traffic cameras linked together:
http://www.twincities.com/ci_10339532

“The screens will also show feeds from security cameras controlled by the State Patrol, Minnesota Department of Transportation, and St. Paul, Minneapolis and Metro Transit police departments.
Before the RNC, there was no interface for all the agencies' cameras to be seen in one place. Local officials could continue to use the system after the RNC.” (Emphasis mine)

So now we have state and local traffic cameras, transit cameras and various police cameras all interconnected and viewable from a central place. This alone is inconsequential. When however, a minor thing like this is repeated many times across a broad range of places and technologies and over a long period of time, the sum of the actions are significant. In this case, what’s needed to turn this into something significant is a database to store the surveillance images and a way of connecting the ~~security and~~ ~~traffic~~ surveillance camera images to cell phone roaming data, WIFI roaming data, social network traffic data, Bluetooth scans and RFID data from automobile tires. Hmm…that actually doesn’t sound too difficult, or at least it doesn’t sound too much more difficult than security event correlation in a large open network. Is there any reason to think that something like that will not happen in the future?

If it did, J Edgar Hoover would be proud. The little bits and pieces that we are building to solve daily security and efficiency ‘problems’ are building the foundation of a system that will permit our government to efficiently track anyone, anywhere, anytime. Hoover tried, but his index card system wasn’t quite up to the task. He didn’t have Moore’s Law on his side.

As one of my colleagues indicates, hyper-efficient government is not necessarily a good thing. Institutional inefficiency has some positive properties. In the particular case of the USA there are many small overlapping and uncoordinated units of local, state and federal government and law enforcement. In many cases, these units don’t cooperate with each other and don’t even particularly like each other. There is an obvious inefficiency to this arrangement. But is that a bad thing?

Do we really want our government and police to function as a coordinated, efficient, centralized organization? Or is governmental inefficiency essential to the maintenance of a free society? Would we rather have a society where the efficiency and intrusiveness of the government is such that it is not possible to freely associate or freely communicate with the subversive elements of society? A society where all movements of all people are tracked all the time? Is it possible to have an efficient, centralized government and still have adequate safeguards against the use of centralized information by future governments that are hostile to the citizens?

As I wrote in Privacy, Centralization and Databases last April:

What's even more chilling is that the use of organized, automated data indexing and storage for nefarious purposes has an extraordinary precedent. Edwin Black has concluded that the efficiency of Hollerith punch cards and tabulating machines made possible the extremely "...efficient asset confiscation, ghettoization, deportation, enslaved labor, and, ultimately, annihilation..." of a large group of people that a particular political party found to be undesirable.
History repeats itself. We need to assume that the events of the first half of the twentieth century will re-occur someday, somewhere, with probably greater efficiency.
What are we doing to protect our future?

We are giving good guys full spectrum surveillance capability so that sometime in the future when they decide to be bad guys, they’ll be efficient bad guys.

There have always been bad governments. There always will be bad governments. We just don’t know when.

09/29-2008 - Updated to correct minor grammatical errors.

Scaling Online Learning - 14 Million Pages Per Day

Some notes on scaling a large online learning application.

09/29-2008 - Updated to correct minor grammatical errors.

Stats:

29 million hits per day, 700/second
14 million .NET active content pages per day^[1]
900 transactions per second
2000 database queries per second
20 million user created content files
Daily user population of over 100,000
Database server with 16 dual core x64 CPU's, 128GB RAM, Clustered
Nine IIS application servers, load balanced
The largest installation of the vendors product

Breadth and complexity. The application is similar to a comprehensive ERP application, with a couple thousand stored procedures and thousands of unique pages of active web content covering a full suite of online learning applications, including content creation and delivery, discussions, quizzing, etc. The application has both breadth and depth, and is approximately as complex as a typical ERP application. This makes tuning interesting. If a quarter million dollar query pops up, it can be tuned or re-designed, but if the load is spread more or less evenly across dozens or hundreds of queries & stored procedures, the opportunities for quick wins are few.

Design. Early design decisions by the vendor have been both blessings and curses. The application is not designed for horizontal scalability at the database tier. Many normal scaling options are therefore unavailable. Database scalability is currently limited to adding cores and memory to the server, and adding cores and memory doesn’t scale real well.

The user session state is stored in the database. The original version of the application made as many as ten database round trips per web page, shifting significant load back to the database. Later versions cached a significant fraction of the session state, reducing database load. The current version has stateless application servers that also cache session state, so database load is reduced by caching, but load balancing decisions can be still made without worrying about user session stickiness. (Best of both worlds. Very cool.)

Load Curve. The load curve peaks early in the semester after long periods of low load between semesters. Semester start has a very steep ramp up, with first day load as much as 10 times the load the day before (See chart). This reduces opportunity for tuning under moderate load. The app must be tuned under low load. Assumptions and extrapolations are used to predict performance at semester startup. There is no margin for error. The app goes from idle to peak load in about 3 hours on the morning of the first day of classes. Growth tends to be 30-50% per semester, so peak load is roughly predicted at 30-50% above last semester peak.

Early Problems

Unanticipated growth. We did not anticipate the number of courses offered by faculty the first semester. Hardware mitigated some of the problem. The database server grew from 4CPU/4GB RAM to 4CPU/24GB, then 8CPU/32GB in 7 weeks. App servers went from four to six to nine.

Database fundamentals: I/O, Memory, and problems like ‘don’t let the database engine use so much memory that the OS gets swapped to disk’ were not addressed early enough.

Poor monitoring tools. If you can’t see deep in to the application, operating system and database, you can’t solve problems.

Poor management decisions. Among other things, the project was not allowed to draw on existing DBA resources, so non-DBA staff were forced to a very steep database learning curve. Better take the book home tonight, 'cause tomorrow you're gonna be the DBA. Additionally, options for restricting growth by slowing the adoption rate of the new platform were declined, and critical hosting decisions were deferred or not made at all.

Unrealistic Budgeting. The initial budget was also very constrained. The vendor said 'You can get the hardware for this project for N dollars'. Unfortunately N had one too few zero’s on the end of it. Upper management compared the vendor's N with our estimate of N * 10. We ended up compromising at N * 3 dollars, having that hardware only last a month & within a year and a half spending N * 10 anyway.

Application bugs. We didn’t expect tempDB to grow to 5 times the size of the production database and we didn’t expect tempDB to be busier than the production database. We know from experience that SQL Server 2000 can handle 200 database logins/logouts per second. But just because it can, doesn’t mean it should. (The application broke its connection pooling.)

32 bits. We really were way beyond what could rationally be done with 32 bit operating systems and databases, but the application vendor would not support ~~Itanic~~ Itanium at all, and SQL 2005 in 64 bit mode wasn’t supported until recently. The app still does not support 64 bit .NET application servers.

Query Tuning and uneven index/key distribution. We had parts of the database were the cardinality looked like a classic long tail problem, making query tuning and optimization difficult. We often had to make a choice of optimizing for one end of the key distribution or the other, with performance problems at whatever end we didn't optimize.

Application Vendor Denial. It took a long time and lots of data to convince the app vendor that not all of the problems were the customer. Lots of e-mail, sometimes rather rude, was exchanged. As time went on, they started to accept our analysis of problems, and as of today, are very good to work with.

Redundancy. We saved money by not making the original file and database server clustered. That cost us in availability.

Later Problems

Moore's Law. Our requirements have tended to be ahead of where the hardware vendors were with easily implementable x86 servers. Moore's Law couldn’t quite keep up to our growth rate. Scaling x86 SQL server past 8 CPU’s in 2004 was hard. In 2005 there were not very many options for 16 processor x86 servers. Scaling to 32 cores in 2006 was not any easier. Scaling to 32 cores on a 32 bit x86 operating system was beyond painful. IBM’s x460 (x3950) was one of the few choices available, and it was a painfully immature hardware platform at the time that we bought it.

The “It Works” Effect. User load tended to ramp up quickly after a smooth, trouble free semester. The semester after a smooth semester tended to expose new problems as load increased. Faculty apparently wanted to use the application but were held back by real or perceived performance problems. When the problems went away for a while they jumped on board, and the next semester hit new scalability limits.

Poor App Design. We had a significant number of high volume queries that required re-parsing and re-optimization on each invocation. Several of the most frequently called queries were not parameterizable and hence had to be parsed each time they were called. At times we were parsing hundreds of new queries per second, using valuable CPU resources on parsing and optimizing queries that would likely never get called again. We spent person-months digging onto query optimization and building a toolkit to help dissect the problem.

Database bugs. Page latches killed us. Tier 3 database vendor support, complete with a 13 hour phone call and Gigs of data finally resolved a years old (but rarely occurring) latch wait state problem, and also uncovered a database engine bug that only showed up under a particularly odd set of circumstances (ours, of course). And did you know that when 32-bit SQL server sees more than 64GB of RAM it rolls over and dies? We didn't. Neither did Microsoft. We eventually figured it out after about 6 hours on the phone with IBM Advanced Support, MS operating system tier 3 and MSSQL database tier 3 all scratching their heads. /BURNMEM to the rescue.

High End Hardware Headaches. We ended up moving from a 4 way HP DL580 to an 8-way HP DL740 to 16-way IBM x460's (and then to 32 core x3950's). The x460's and x3950's ended up being a maintenance headache, beyond anything that I could have imagined. We hit motherboard firmware bugs, disk controller bugs, had bogus CPU overtemp alarms, hardware problems (bad voltage regulators on chassis interface boards), and even ended up with an IBM 'Top Gun' on site (That's her title. And no, there is no contact info on her business card. Just 'Top Gun'.)

File system management. Maintaining file systems with tens of millions of files is a pain, no matter how you slice it.

Things that went right.

We bought good load balancers right away. The Netscalers have performed nearly flawlessly for 4 years, dishing out a thousand pages per second of proxied, content switched, SSL’d and Gzip’d content.

The application server layer scales out horizontally quite easily. The combination of proxied load balancing, content switching and stateless application servers allows tremendous flexibility at the app server layer.

We eventually built very detailed database statistics and reporting engine, similar to Oracle AWR reports. We know, for example, what the top N queries are for CPU, logical I/O, physical I/O. etc, at ten minute intervals any time during the last 90 days.

HP Openview Storage Mirroring (Doubletake) works pretty well. It's keeping 20 million files in sync across a WAN with not too many headaches.

We had a few people who dedicated person-years of their life to the project, literally sleeping next to their laptops, going for years without ever being more than arms reach from the project. And they don’t get stock options.

I ended up with a couple quotable phrases to my credit.

On Windows 2003 and SQL server:

"It doesn't suck as bad as I thought it would"

and

"It's displayed an unexpected level of robustness"

Lessons:

Details matter. Enough said.

Horizontal beats vertical. We know that. So does everyone else in the world. Except perhaps our application vendor. The application is still not horizontally scalable at the database tier and the database vendor still doesn't provide a RAC like horizontally scalable option. Shards are not an option. That will limit future scalability.

Monitoring matters. Knowing what to monitor and when to monitor it is essential to both proactive and reactive application and database support.

AWR-like reports matter. We have consistently decreased the per-unit load on the back end database by continuously pounding down the top 'N' queries and stored procedures. The application vendor gets a steady supply of data from us. They roll tweaks and fixes from their customers into their normal maintenance release cycle. It took a few years, but they really do look at their customers' performance data and develop fixes. We fed the vendor data all summer. They responded with maintenance releases, hot fixes and patches that reduced database load by at least 30%. Other vendors take note. Please.

Vendor support matters. We had an application that had 100,000 users, and we were using per-incident support for the database and operating system rather than premier support. That didn't work. But it did let us make a somewhat amusing joke at the expense of some poor first tier help desk person.

Don’t be the largest installation. You’ll be the load test site.

The Quarter Million Dollar Query

Unlimited Resources

Naked Without Strip Charts
^[1] For our purposes, a page is a URL with active content that connects to the database and has at least some business logic.

Acronyms

Two acronyms worth remembering.

RGE:

RGE: (Resume Generating Event) – An event that forces a person, or the persons manger to generate an updated resume.

An RGE is something most of us don't want to experience, at least not too often. RGEs are often followed by changes in income, housing, marital status, etc.

HGE:

HGE: (Headline Generating Event) – An event that causes news reporters to write stories and generate headlines.

HGEs can be either positive or negative, depending on the causes and effects of the event, although with the exception of dot-com startups, most IT initiated HGEs are negative events related to system or project failures of some sort.

HGEs are often followed by RGEs.

Obviously a goal of system mangers, security people and IT folks in general is to make sure that that acronyms like the above don’t show up unexpectedly. Those of us in public service are particularly sensitive to HGEs. There are not too many circumstances where public service IT organizations can generate positive headlines. Odds are that if there are headlines, they are not good. There is no incentive for the local news broadcast to begin with a segment on your shiny and fast new servers or your four nine’s of application uptime.

We spend a lot of time analyzing risk in security decisions, system designs, deployments and upgrades. If we do it right, we can design, build, manage and maintain systems that meet user/customer requirements while minimizing the probability of triggering an HGE and the follow on RGEs.

And if we are REALLY doing it right, we'll have fun while we are doing it.

Design your Failure Modes

In his axiom 'Everything will ultimately fail', Michael Nygard writes that in IT systems, one must:

"Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. [....] If you do not design your failure modes, then you will get whatever unpredictable---and usually dangerous---ones happen to emerge."

I'm pretty sure that I've seen a whole bunch of systems and applications where that sort of thinking isn't on the top of the software architects or developers stack. For example, I've seen:

apps that spew out spurious error messages to critical logs files at a rate that makes 'tail -f' useless. Do the errors mean anything? Nope - just some unhandled exceptions. Could the app be written to handle the exceptions? Yeh, but we have deadlines......
apps that log critical application server/database connectivity error messages back to the database that caused the error. Ummm...if the app server can't connect to the database, why would you attempt to log that back to the database? Because that's how our error handler is designed. Doesn't that result in a recursive death spiral of connection errors that generate errors that get logged through the connections that are in an error state? Ummm.. let me think about that.....
apps that stop working when there are transient network errors, and need to be restarted to recover. Network errors are normal. Really? We never had that problem with our ISAM files!. Can you build your app to gracefully recover from them? Yeh, but we have deadlines......
apps that don't start up if there are leftover temp files from when they crashed and left temp files all over the place. Could you clean up old temp files on startup? How would I know which ones are old?

I suspect that mechanical engineers and metallurgists, when designing motorcycles, autos, and things that hurt people, have that sort of axiom embedded into their daily thought processes pretty deeply. I suspect that most software architects do not.

So the interesting question is - if there are many failure modes, how do you determine which failure modes that you need to engineer around and which ones you can safely ignore?

On the wide area network side of things, we have a pretty good idea what the failure modes are, and it is clearly a long tail sort of problem, something like:

We've seen that circuit failures, mostly due to construction, backhoes, and other human/mechanical problems, are by far the largest cause of failures and are also the slowest to get fixed. Second place, for us, is power failures at sites without building generator/UPS, and a distant third is hardware failure. In a case like that, if we care about availability, redundant hardware isn't anywhere near as important as redundant circuits and decent power.

Presumably each system has a large set of possible failure modes, and coming up with a rational response to the failure modes that are on the left side of the long tail is critical to building available systems, but it is important to keep in mind that not all failure modes are caused by non-animate things.

In system management, human failures, as in a human pushing the wrong button at the wrong time, are common and need to be engineered around just like mechanical or software failures. I suspect that is why we need things like change management or change control, documentation and the other no-so-fun parts of managing systems. And humans have the interesting property of being able to compound a failure by attempting to repair the problem, perhaps the reason why some form of outage/incident handling is important.

In any case, Nygard's axiom is worth the read.

Using the DNS Question to Carry Randomness - a Temporary Hack?

I read Mark Rothman's post "Boiling the DNS Ocean". This lead me to a thought (just a thought), that somewhere within the existing DNS protocol, there has to be a way of introducing more randomness in the DNS question and get that randomness back in the answer, simply to increase the probability that a resolver can trust an authoritative response. Of course having never written a resolver, I'm not qualified to analyze the problem -- but this being the blogosphere, that's not a reason to quit posting.

So at the risk of committing bloggo-suicide....Here goes......

Patch Now - What Does it Mean?

When security researchers/bloggers announce to the world 'patch now', are they are implying that the world should 'patch now without consideration for testing, QA, performance or availability'? Or are they advising an accelerated patch schedule, but in a change managed, tested, QA’d rollout of a patch that considers security and availability? And when they complain about others not patching fast enough, are they assuming that the foot draggers are incompetent? Or are they ignoring the operational realities of making untested changes to critical infrastructure?

Consider that:

All patches have a probability of introducing new bugs. That probability is always > 0 and <= 1. The probability is never equal to zero. (And for a certain large database vendor, our experience is that the probability of introducing new bugs is very close to one).
There are many, many bugs that are only relevant under high loads.
A patch that corrupts data, as in databases or file systems, can be impossible to back out or recover from without irretrievable data loss.
Building test cases that can put realistic real world loads on test servers is very difficult, very expensive, and may not uncover the new bugs anyway.
A failed system or application has known, documented consequences. It is not a game of probability or chance. An unpatched security vulnerability is a game of chance where in most cases the odds against you are not known.

As an operations person with real responsibilities, who is accountable to a very large group of paying customers, and who has to make security versus availability decisions almost every day, I need security researchers to uncover, analyze and communicate risks, threats, vulnerabilities and mitigation techniques. The best of the researchers already do that very well, and for that I am very grateful. To those who are doing that for public service, fame, fortune or personal ego, I sincerely thank you, no matter what your motivation. You are adding value to the Internet community.

But when security people push recommendations out to the world without consideration for availability and/or performance, their recommendations remove value from the Internet community.

Security Researchers add value when

Uncovering and analyzing vulnerabilities and active exploits. (Research)
Analyzing probable and improbable attack vectors and calculating and communicating probabilities. (Research)
Testing and verifying attack vectors. (Research)
Communicating to the community the relative and absolute risks of vulnerabilities and consequences of exploitation. (Public Service)
Developing and communicating mitigation options. (Research)

Security Researchers do not add value when

Making blanket patch advice without consideration for performance or availability. (Operations)
Complaining about enterprises that do not follow their advice. (Carping)

(non-exhaustive lists, of course.)

In that context, when I hear 'patch now' advice, You can bet that I will filter the advice through the prism of availability, performance and operational reality.

I'll listen to 'patch now no matter what' advice from a security researcher/blogger who has real time operational responsibility for a large customer base, perhaps 100,000 or more customers, and who, if the patch fails, would be responsible for interruption of service for those hundred thousand customers, and who, if the patch fails, could or would be terminated for non-performance.
I'll listen to 'patch now no matter what' advice from a security researcher/blogger who has had a system with a hundred thousand customers down hard, has escalated to the vendors highest support level, and who has been on a tech support conference call for 13 continuous hours or more.
I'll listen to 'patch now no matter what' advice from consultants who are putting the reputation and existence of their consultancy on the line every time they give a customer advice.
I'll listen to 'patch now no matter what' advice from our own security staff, who I know will not point fingers, duck and hide when the patch goes bad and my systems fail.

As far as I am concerned, if you are in a position like one of the above, you can complain about service providers who do not patch fast enough to suite your preferences. If you are not in that position, you cannot complain when I don't (or your service provider doesn't) patch fast enough for you.

The bottom line is that unless the people who give the world advice to 'patch now no matter what' are also going to write my e-mail's and presentations explaining why my systems failed, unless they will absorb the inevitable backlash from customers, senior management, governing boards and will stand up in front of representatives from my internal business units and get grilled, castigated, chewed up and spit out for my decision, I don't need them to complain that I am not 'patching now'.

I've been in the 7pm vendor conference call with vendor VP and development supervisor, where our CIO came to the meeting with his/her letter of resignation, to be turned in to our CEO should the vendor fail to deliver performance fixes for the business critical application by 7am the next day.
It was not a fun meeting.

'Patch now' advice must be filtered through the prism of availability, performance and operational reality.