Last In - First Out: October 2008

Amusing log messages

I give Cisco credit for fully documenting firewall log messages. In theory this gives users the ability to set up a system for catching interesting log messages and ignoring uninteresting messages. More vendors should be so bold as to actually acknowledge that their products log messages, and that those messages need to be documented.

This level of disclosure has an interesting side effect. I'm not sure what I'd do if one of our ASA's logged this error:

Error Message %ASA-2-716515:internal error in: function: OCCAM failed to allocate memory for AK47 instance

Explanation The OCCAM failed to allocate memory for the AK47 instance.

Or this error:

Error Message %ASA-2-716508: internal error in: function: Fiber scheduler is scheduling rotten fiber. Cannot continuing terminating

Explanation The fiber scheduler is scheduling rotten fiber, so it cannot continue terminating.

Fiber rot?

An AK47 instance?

No doubt those messages mean something to someone at the TAC. For the rest of us, they are mostly just amusing.

The Cloud – Outsourcing Moved up the Stack

Why is outsourcing to ‘the cloud’ any different than what we’ve been doing for years?
The answer: It isn’t.
We’ve been outsourcing critical infrastructure to cloud providers for decades. This isn’t a new paradigm, it’s not a huge change in the way we are deploying technology. It’s pretty much the same thing we’ve always been doing. It’s just moved up the technology stack.
We’ve been outsourcing layer 1 forever (WAN circuits), layer 2 for a couple decades (frame relay, ATM, MPLS), and sometimes even layer 3 (IP routing, VPNs’) to cloud providers. Now we have something new – outsourcing layers 4 through 7 to a cloud provider.

So we are scratching our heads trying to figure out what this ‘new’ cloud should look like, how to fit our apps into a cloud and what the cloud means^[1] for security, availability and performance. Heck we’re not even sure how to patch the cloud^[2], or even who is responsible for patching a cloud.
I’ll argue that outsourcing CPU, database or storage to a server/application cloud isn’t fundamentally different than outsourcing transport to an MPLS cloud, as practically everyone with a large footprint is already doing. In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider.
What would happen if we took a look at the parts of our infrastructure that are already outsourced to a cloud provider and see if we can apply lessons learned from layers 1 through 3 to the rest of the stack.

Lesson 1: The provider matters. We use both expensive Tier 1 providers and cheap local transport providers for a reason. We have expectations of our providers and we have SLA’s that cover among other things, availability, management, reporting, monitoring, incident handling and contract dispute resolution. When a provider fails to live up to SLA’s, we find another provider (See Lesson 3). If we’ve picked the right provider, we don’t worry about their patch process. They have an obligation to maintain a secure, available, reliable service, and when they don’t, we have means to redress the issue.

Lesson 2: Design for failure. We provision multiple Tier 1 ISP’s to multiple network cores for a reason. The core is spread out over 4 sites in two cities for a reason. We have multiple providers, multiple paths of 10 Gig’s between the cores for a reason. We use two local providers to each hub for a reason. The reason is – guess what – sh!t happens!. But we know that if we do it correctly, we can loose a 10 Gig connection to a Tier 1 and nobody will know, because we designed for failure. And when we decide to cut costs and use a low cost provider or skimp on redundancy, we accept the increased risk of failure, and presumably have a plan for dealing with it.

Lesson 3: Deploy a standard technology. We don’t care if our MPLS providers use Juniper, Cisco or Extreme for layer 2 transport, because it doesn’t matter. We don’t deploy vendor specific technology, we deploy standardized interoperable technology. We all agree on what a GigE handoff with jumbo MTU’s and single mode long reach lasers looks like. It’s the same everywhere. We can bring in a new ISP or backbone transport provider, run the new one in parallel to the old, seamlessly cut over to the new, and not even tell our customers.

What parallels can we draw as we move the cloud up the stack?

My provider doesn’t prioritize my traffic (CPU, memory, disk I/O): Pay them for QoS. Priority bits (CPU cycles, I/O’s) cost more that ‘best effort’ bits (CPU cycles, I/O’s). They always have and always will.

My provider doesn’t provide reliable transport (CPU, Memory, Operating Systems, App Servers, Databases): Get a Tier 1 network provider (cloud provider), or get two network providers (cloud providers) and run them in parallel.

My provider might not have enough capacity: Contract for burst network (CPU, I/O) capacity. Contract and pay for the ability to determine which bits(apps) get dropped when oversubscribed. Monitor trends and anticipate growth and load, and add capacity proactively.

My provider might go bankrupt or have catastrophic failure of some sort: You’ve got a plan for that, right? They call it a backup network provider (cloud host). And your app is platform and technology neutral so you can seamlessly move your app to the new provider, right?

My provider might not have a secure network (Operating System, Database): Well, you’ll just have to encrypt your traffic (database) and harden you edge devices (applications) against the possibility that the provider isn’t secure.

Instead of looking back at what we are already doing and learning from what we’ve already done, we are acting like this is something totally new. It isn’t totally new.
It’s just moved up the stack.
The real question: Can the new top of stack cloud providers match the security, availability and reliability of the old layer 1-2-3 providers?

^[1]Techbuddha, Cloud Computing, the Good, The Bad, and the Cloudy, Williams
^[2] Rational Survivability, Patching The Cloud?, Hoff

The Patch Cycle

The patch cycle starts again, this time with a bit of urgency. A 'patch now' recommendation has hit the streets for what seems to be an interesting Windows RPC bug.

What does 'patch now' mean this time? Hopefully it means a planned, measured and tested patch deployment, but at an accelerated schedule.

It's a Microsoft patch, and that's a good thing. The routine of monthly Microsoft security patches has been honed to a fine art in most places, making Windows OS patches by far the simplest and most trouble free of the platforms that we manage. This one appears to be no exception, at least so far.

Just for grins I drew up a picture of what a typical Microsoft Windows patch cycle looks like. The patch kits show up once a month. Most months have at least one 'important' patch, so most monthly patches get applied. Life is easier if you can fit the patch cycle into a one month window, just because the probability of missing a patch or patching out of order is greatly reduced, even if the WSUS toolkit simplifies the process to the point where it's pretty foolproof.
The Microsoft Windows patch cycle typically looks something like this:

It's more or less a month long cycle that sometimes drags out to more than a month, and occasionally even drags on far enough that we roll two months into one. We deviate from the linear plan somewhat, because we have servers that manage the infrastructure that we patch sooner, and we have less critical applications that we patch early, leaving the most critical applications for last. There are also obnoxious, clueless application vendors that don't support patched servers, so some of those get held back also.

Once a year or so, a critical vulnerability shows up. In an extreme case, the patch cycle follows pretty much the same path, but with an accelerated time line, something like this:

That's a fast time line, but in Windows-land the process is practiced often enough that even an accelerated time line is fairly low risk. In this strange world practice makes perfect, and nobody has more practice at patching that Windows sysadmins.

Compare this to another platform, one without the well honed, routine, trouble free patching system that Microsoft has developed.

There are a whole bunch of those to choose from, so let's randomly pick Oracle, just for the heck of it. Here's what a typical Oracle patch time line looks like:

Can you see the difference?

Maybe that's why so many DBA's don't patch Oracle.

Missing the Point

ExtremeTech reviewed the new Fit-PC Slim.

Conclusion:

CompuLabs really needs to step up to a more modern platform if it wants to stay competitive in the rapidly growing market for small, net-top PCs.

They missed the point. It's not a "net-top" or desktop replacement, it's an extremely low wattage home server.

The spec that matters:

Power consumption: 4-6W

Compare that to the 50-100w of typical desktops that are used as home servers & left running 24 hours per day, or the 20+ watts of a typical notebook. Even an Eee PC uses 15 watts.

If what you need is a home server to use as a samba share, a web server or similar always-on device, a 5 watt brick looks pretty interesting. That's 500kwh/yr less power, 400kg less CO2, and $50 less on your electric bill per year than the old desktop-turned-server that you have stuffed under your desk.

And don't whine about the 500mhz processor and 500mb RAM. We ran LAMP stacks that served up more users than your house ever will on a quarter of that.

Wide Area Network Outage Analysis

The following is an brief analysis of unplanned network outages on a large state wide network with approximately 70 sites at bandwidths from DS3 to GigE. The data might be interesting to persons who need to estimate expected availability of wide area networks.

The network is standard core, hub, spoke, leaf. The core is fully redundant. The hubs have redundant circuits connecting to multiple hubs or cores, redundant power and partially redundant hardware. The spokes and leaf sites are non-redundant.

The source or the data was a shared calendar where outages were recorded as calendar events. The data was gathered at analysis time and is subject to omissions and misinterpretation. Errors likely are undercounts.

Raw data, by approximate cause

88 Total Outages
290 Total Hours of Outage
2 years calendar time

Failures by type and duration

Cause	# of Incidents	Percent	# of Hours	Percent
Circuit Failures	34	39%	168	58%
Equip Failures	24	66%	60	79%
Power Failures	22	91%	53	97%
Unknown	5	97%	7	99%
Other	3	100%	2	100%

Total	88		290

Column Definitions

# of Incidents	=	Raw count of outages affecting one or more sites
# of Hours	=	Sum of duration of outages affecting one or more sites
Percent	=	Cumulative Percentage of corresponding column

Cause Definitions

Circuit Failures	=	Failures determined to be circuit related, primarily fiber cuts
Equip Failures	=	Failures determined to be router, firewall or similar
Power Failures	=	Failures where site power was cause of outage
Unknown	=	Failure cause undetermined, missing information
Other	=	All other failures

Pareto Chart - Number of Incidents

A visual representation of the failures shows causes by number of outages. If I remember my statistical process control training from 20 years ago, a Pareto chart is the correct representation for this type of data. The chart shows outage cause on the X-axis, outage count on the left Y-axis and cumulative percent of outages on the right Y-axis.

Using the Pareto 80/20 rule, solving circuit failure resolves 40% of outages by count. Solving equipment failures resolves another 25%. Solving power failures resolves another 25% of outages.

Power failures are probably the least costly to resolve. Long running UPS's are inexpensive. The individual sites supply power and UPS for network equipment at the leaf sites. The sites have variable configurations for power and UPS run times. The area has frequent severe weather, so power outages are common.

Circuit failures are the most expensive to solve. Circuits have high on going costs compared to hardware. The sites are already configured with the lowest cost available carrier, so redundant or protected circuits tend to be more costly than the primary circuit. Circuit failures also appear to be more frequent in areas with rapid housing growth, construction and related activity. For fiber paths provisioned above ground, storm related failures are common.

Pareto Chart - Hours of Outage

A representation of total outage duration in hours by cause is also interesting.

When considering the total number of hours without service, the causes occur in the same relative order. Solving circuit failures resolves 60% of the total outage hours. Circuit outages have a disproportionate share of total outage duration, likely because circuit failures take longer to resolve (MTTR is higher).

Availability Calculations

The network is composed of approximately 70 sites (the number varies over time). The time frame of the raw data is approximately two years. The numbers are approximations.

Outage Frequency:

70 sites * 2 years = 140 site-years.
88 outages /140 site-years = .6 outages/year.
140 site-years / 88 outages = 1.6 years MTBF

Sites should expect to have slightly less than one unplanned outage per year on average, over time. Caution is advised, as the nature of this calculation precludes using it to predict the availability of a specific site.

Outage Duration:

Availability is calculated simply as

(Hours Actually Available)/(Hours Possibly Available)

70 sites * 2 years * 8760 hours/year = 1.23m Hours possible
1.23m hours -288 hours = Hours actually available
(1.23m hours -288 hours )/(1.23m hours )= 99.95% availability.

Availability on average should be three nines or better.

This syncs up fairly well with what we've intuitively observed for sites with non-redundant networks. Our seat of the pants rule is that a non-redundant site should expect about 8 hours unplanned outage per year. We assume that Murphy's Law will make the failure on the most critical day of the year, and we expect that areas with rapid housing development or construction will have more failures.

This also is consistent with service provider SLA’s. In most cases, our providers offer 99.9% availability SLA’s on non-redundant, non-protected circuits.

A uniquely regional anomaly is the seasonal construction patterns in the area. Frost depth makes most underground construction cost prohibitive for 5 months of the year, so construction related outages tend to be seasonal.

The caveat of course, is that some sites may experience much higher or lower availability than other sites.

Related posts: Estimating the Availability of Simple Systems

There are some things about computers I really don’t miss…

There are some things about computers I don’t think I’m ever going to miss. Nostalgia has limits.

I’m not going to miss:

Programming machine tools using paper tape and a Flexowriter, and copying the paper tape to Mylar tape for production. But only if it was a good program, one that didn't drill holes in the wrong place on an expensive casting or smash the machine tool spindle into the tooling fixture and break really expensive stuff.

Submitting a punch card deck to the mainframe operators, waiting four hours for the batch scheduler to compile and run the program, only to find a syntax error. Especially for a required assignment the last week of the semester.

Waiting for a goofy little homemade PDP-8 to assemble, link and load a 50 line assembler program (about 40 minutes of watching tape cartridges spin, if memory serves.)

Booting CAD/CAM systems by toggling bits and loading instructions from front panel switches. And then programming complex machine tools using a strange path description language, a pen plotter, a teletype, and punched tape. State of the art, at the time. The plotter even waited for you to put a new pen in every time it needed to draw a line in a new color.

Running CAD/CAM systems from floppies. (A CAD system that could do 3D wire frame views no less). Floppies though, were a vast improvement over paper or magnetic tape. You could access the data on them randomly. Amazing.

NetWare 2.0a server kernels, each one built from object modules custom linked specifically for the hardware using a linker and modules spread out over boxes of floppies, some of which had to be inserted more than once, and dozens of menu choices, including IRQ's, I/O ports, and memory addresses. If any of them were wrong, the kernel didn't boot. If the kernel didn't boot, you started over with disk #1. If it DID boot, you bought a round at the local pub, because life was good, and celebration was required.

NetWare client installations, when the Netware drivers were custom linked to match the I/O, IRQ and memory jumpers on the network card. Move a jumper to avoid an IRQ conflict and you'll have to re-run the linker and generate a new driver.

Using NetWare BRGEN to build routers, and linking the kernel of a four-port Arcnet router made out of an old XT and using it as the core router of a campus network. It worked though, and best of all I didn't have to walk across the building to manage departmental servers. And yes, it was possible to allocate IRQ's and I/O ports for four Arcnet cards in a single PC.

CGA graphics. The choices were four colors at 320x200 pixels, or two colors at 640x200 pixels(!). For serious CAD/CAM graphics, the 640x200 black & white mode was the preferred choice.

Endless hours spent moving jumpers on ISA cards, trying to get all the video, memory and interface cards to work together without IRQ, I/O port and memory address conflicts.

ROM BASIC

Electronic Typewriters. The ones that cost you two weeks of wages and had one whole line of memory.

Even more hours spent trying to get the drivers for the interface cards all stuffed into 640k and still have enough memory left to run AutoCAD or Ventura Desktop Publishing.

Recovering lost files from damaged floppies by manually editing the file allocation table. (Norton Utilities Disk Editor!)

TSR's.

Writing programs and utilities using 'copy con' and debug scripts copied from magazines.

Abort, Retry, Fail?

Early Mac fans and their dot-matrix printed documents that had eight different fonts on a page. Just because you can…doesn’t mean you should….

Sneakernet.

Running Linux kernel config scripts, hoping that the dozens of choices that you made, not knowing what most of them meant, would compile and link a bootable kernel. (Bootable kernels tended to be far more useful than non-bootable kernels).

Installing Linux from Floppies.

Patching Linux kernel source code to block Ping-of-Death so the primary name server would stay up for more than five minutes.

Editing X config files, specifying graphics card and monitor dot-clock settings, hoping that your best guess wouldn't smoke your shiny new $2000 NEC 4D 16" monitor.

OS/2 installations from 30-odd floppy disks, then another 20 or so for the service pack or PTF (or whatever they called it). CD-ROMs were expensive.

Broadcast storms.

I’m pretty sure that a hole bunch of the things we do today will look pretty archaic a decade or two from now. So what’s this list is going to look like twenty years from now?

Bank of America SafePass Authorization

Unlike American Express, Bank of America seems to have pretty decent account claiming, user id and password requirements. Additionally, BofA allows account holders to set up SMS alerts on various types of account activity.

The login process can be tied to SafePass^® SMS based authentication. To complete the login process, BofA sends a six digit code to your cell phone. The code and your normal password are both required for access to your on line account.

Additionally, BofA automatically uses the SMS based SafePass^® for changes to the account, including alerts, e-mail address changes, account claiming etc. You also can set up your account to send SMS alerts on significant account activity and any/all changes to account profiles, including on line charges, charges greater than a specific amount and international charges.

The user id and passwords are also allowed to be significantly more complex than American Express, allowing more than 8 characters and permitting various non-alphanumeric characters.

Your Online ID:

Must be 6 to 32 characters.
Can also contain these characters: @ # % * ( ) + = { } / ? ~ ; , . – _
Can contain all letters, otherwise must be a combination of 2 character types (Alpha, numeric & special)
Cannot contain spaces.
Cannot be the same or contain your Social Security number or Check Card number.

Your Passcode:

Must be between 8 - 20 characters
Must include at least 1 number and 1 letter
Can include uppercase and lowercase letters
Can contain the following characters: @ # % * ( ) + = { } /\ ? ~ ; : " ' , . - _ |
Cannot contain any spaces Cannot contain the following characters: $ < > & ^ ! [ ]
Cannot be the same as your Online ID

These features, plus the availability of merchant specific temporary credit card numbers (ShopSafe^®) makes the banking experience appear to be much closer to what one would think was needed for 21st century banking.

Trivial Account Reset on American Express Accounts (Updated)

2008-10-06 Update: I did eventually get an e-mail notice sent to the e-mail associated with the account about 6 hours after I reset my password.It still looks to me like the account can be hijacked, and the password restrictions and suggested examples are pathetic.

Account claiming is an interesting problem. The tradeoffs necessary to balance ease of use, security and help desk call volume are non-trivial.

2008-10-05 9:59 PM:

I'm a bit disappointed how easy it was to recover online access to my American Express account.

Enter the card number
Enter the four digit card ID number on the front of the card
Enter my mothers maiden name

That's all you need. The first two numbers are obtainable by possession of the card, the third is readily available from on line searches. Enter those three bits of info and you get a screen with your user name and the option to set a new password. Set up a new password and you have full access, including the ability to request new cards, change e-mail and billing addresses, etc. Go ahead and reset your password, but whatever you do, don't let the password be more than 8 characters or contain

"spaces or special characters (e.g., &, >, *, $, @)"

That makes choosing a password tough. My normal &mex$uck$ password will not work. But fortunately for me, the help screens on picking a new password contain useful examples:

Examples of a valid password are: snowman4, 810main, and year2k."

Never mind that whole dictionary thing. Nobody will ever guess a password like 'year2k'.

The Amex account is set up to send me an SMS alert for any 'Irregular Account Activity'. I did not get an SMS, even though on line recovery of both userid and password would certainly be worth an SMS in my book.

There are better ways of doing this. They could have asked me for some secret number that only exists on my last statement, or information on recent card activity, or perhaps like my health care provider, the account reset could generate a letter with a token, sent to my home address via good old fashioned postal mail.