Availability & SLA’s – Simple Rules

From theDailyWtf, a story about availability & SLA’s that’s worth a read about an impossible availability/SLA conundrum. It’s a good lead in to a couple of my rules of thumb.

“If you add a nine to the availability requirement, you’ll add a zero to the price.”

In other words, to go from 99.9% to 99.99% (adding a nine to the availability requirement), you’ll increase the cost of the project by a factor of 10 (adding a zero to the cost).

There is a certain symmetry to this. Assume that it’ll cost 20,000 to build the system to support three nines, then:

99.9 = 20,000
99.99 = 200,000
99.999 = 2,000,000

The other rule of thumb that this brings up is

Each technology in the stack must be designed for one nine more than the overall system availability.

This one is simple in concept. If the whole system must have three nines, then each technology in the stack (DNS, WAN, firewalls, load balancers, switches, routers, servers, databases, storage, power, cooling, etc.) must be designed for four nines. Why? ‘cause your stack has about 10 technologies in a serial dependency chain, and each one of them contributes to the overall MTBF/MTTR. Of course you can over-design some layers of the stack and ‘reserve’ some outage time for other layers of the stack, but in the end, it all has to add up.

Obviously these are really, really, really rough estimates, but for a simple rule of thumb to use to get business units and IT work groups thinking about the cost and complexity of providing high availability, it’s close enough. When it comes time to sign the SLA, you will have to have real numbers.

Via The Networker Blog

More thoughts on availability, MTTR and MTBF:

NAC or Rootkit - How Would I know?

I show up for a meeting, flip open my netbook and start looking around for a wireless connection. The meeting host suggests an SSID. I attach to the network and get directed to a captive portal with an ‘I agree’ button. I press the  magic button an get a security warning dialogue.

NAC-Rootkit It looks like the network is NAC’d. You can’t tell that from the dialogue though. ‘Impluse Point LLC’ could be a NAC vendor or a malware vendor. How would I know? If I were running a rouge access point and wanted to install a root kit, what would it take to get people to run the installer?  Probably not much. We encourage users  to ignore security warnings.

Anyway – it was amusing. After I switched to my admin account and installed the ‘root kit’ service agent and switched back to my normal user, I got blocked anyway. I’m running Windows 7 RC without anti-virus. I guess NAC did what it was supposed to do. It kept my anti-virus free computer off the network.

I’d like someone to build a shim that fakes NAC into thinking I’ve got AV installed. That’d be useful.

Consulting Fail, or How to Get Removed from my Address Book

Here’s some things that consultants do that annoy me.

Some consultants brag about who is backing their company or whom they claim as their customers. I’ve never figured that rich people are any smarter than poor people so I’m not impressed by consultants who brag about who is backing them or who founded their company. Recent ponzi and hedge fund implosions confirm my thinking. And it seems like the really smart people who invented technology 1.0 and made a billion are not reliably repeating their success with technology 2.0. It happens, but not predictably, so mentioning that [insert famous web 1.0 person here] founded or is backing your company is a waste of a slide IMHO.

I’m also not impressed by consultants who list [insert Fortune 500 here] as their clients. Perhaps [insert Fortune 500 here] has a world class IT operation and the consultant was instrumental in making them world class. Perhaps not. I have no way of knowing. It’s possible that some tiny corner of [insert Fortune 500 here] hired them to do [insert tiny project here] and they screwed it up, but that’s all they needed to brag about how they have [insert Fortune 500 here] as their customer and add another logo to their power point.

I’m really unimpressed when consultants tell me that they are the only ones who are competent enough to solve my problems or that I’m not competent enough to solve my own problems. One consulting house tried that on me years ago, claiming that firewalling fifty campuses was beyond the capability of ordinary mortals, and that If we did it ourselves, we’d botch it up. That got them a lifetime ban from my address book. They didn’t know that we had already ACL’d fifty campuses, and that inserting a firewall in line with a router was a trivial network problem, and that converting the router ACL’s to firewall rules was scriptable, and that I already written the script.

I’ve also had consultants ‘accidently’ show me ‘secret’ topologies for the security perimeters of [insert fortune 500 here] on their conference room white board. Either they are incompetent for disclosing customer information to a third party, or they drew up a bogus whiteboard to try to impress me. Either way I’m not impressed. Another lifetime ban.

Consultants who attempt to implement technology or projects or processes that the organization can’t support or maintain is another annoyance. I’ve see people come in and try to implement processes or technologies that although they might be what the book says or what every one else is doing, aren’t going to fit the organization, for whatever reason. If the organization can’t manage the project, application or technology after the consultant leaves, a perceptive consultant will steer the client towards a solution that is manageable and maintainable. In some cases, the consultant obtained the necessary perception only after significant effort on my part with the verbal equivalent of a blunt object.

Recent experiences with a SaaS vendor annoyed me pretty badly when they insisted on pointing out how great their whole suite of products integrate, even after I repeatedly and clearly told them I was only interested in one small product, and they were on site to tell me about that product, and nothing else. “I want to integrate your CMDB with MY existing management infrastructure, not YOUR whole suite. Next slide please. <dammit!>”. Then it went down hill. I asked them what protocols they use to integrate their product with other products in their suite. The reply: a VPN. Technically they weren’t consultants though. They were pre-sales.

That’s not to say that I’m anti consultant. I’ve seen many very competent consultants who have done an excellent job. At times I’ve been extremely impressed.

Obviously I’ve also been disappointed.

Your Application is a Rotting Old Shack, Now What?

In response to A Shack in the Woods, Crumbling at the Core, colleague Jim Graves commented:
“…it only works if application owners are like long-term homeowners, not house flippers.”
Shack-Jason_PrattGood point. Who cares if the shack gets a cheap paint job instead of a foundation and a comprehensive re-modeling? Will the business owner know or care? Do the contractors you hired care? Are you going to be around long enough to care? Are you and your employees, managers and consultants acting as house job flippers, painting over the flaws so you can update your resumes, take the profits and move on?

Jim asks:
“Are long-term employees more likely to care about problems that may happen five years from now? Are Highly Paid Consultants much less likely to?”
Good question. Suppose that I want to fix the shack. Maybe I’m tired of having to empty the buckets that catch the drips from the roof (or restart the J2EE app that runs itself out of database connections a couple times a week). If this repair is to be anything other than a paint-over, at least one element in the business owner-->employee-->consultant-->contractor chain will have to care enough about the application to ensure that the remodeling is oriented toward long term structural repairs and not just a paint-over. The other elements in the chain need to concur.

For much of what I host and/or manage, I’m on the second or third remodeling cycle. I’ve seen the consultants parachute in, slap a coat of paint on the turd and walk off with a half decade of my salary. I don’t like it. This puts me squarely in the camp of putting the effort into fixing foundations instead of slapping on paint and shingles. I’ve seen apps that have been around for ten years and two refresh cycles, have had ten million dollars spent on them and still have mold, rot and leaks from a decade ago. But they have a shiny new skin and state of the art UI. For wireframes, UI models, usability studies and really nice Power Points? Spare no expense. Do they have a sane data model and even trivial application security? Not even close. Granite countertops are in scope, fixing the foundation is out of scope.

Things like that make me crabby.

For now I’ll assume that the blame lies with IT. Somewhere along the line the non-technical business owners have been led to believe that the shiny face and the UI is the application, and that the foundational elements (back end code, databases, servers, networks and security) are invisible, unimportant, or otherwise non-essential.

Resume Driven Design

Sam Buchanan, a long time colleague, commenting on a consultants design for a small web application:

“I'm telling you: this app reeks of resume-driven design”

In ‘Your Application is a Rotting old Shack’ I whined mused about applications that get face lifts while core problems get ignored. Let’s assume for a moment that business units finally figure out that their apps have a crumbling foundation and need structural overhauls. Assuming that internal resources don’t exist, how do we know that the consultants and contractors that we hire to design and build our systems aren’t more interested in building their resumes than our applications?

I’d like to think that I would be able to tell if a consultant tried to recommend an architecture or design that exists more to pad their resume than solve my problems. It’s probably not that straight forward though. Consultants have motivations that may intersect with your needs, or they may have motivations that significantly deviate from what you need, and if their motivations are resume driven, there is a chance that you’ll end up with a design that helps someone's resume more that it helps you.

Short term employees may share some of the same motivations. If they are using you to fill out their resume, you’d better have needs that line up with the holes in their resume. I’m pretty sure that ‘slogged through a decade old poorly written application, identified unused code and database objects’ or ‘documented and cleaned up an ad hoc, poorly organized, data model’ isn’t the first thing people want on their resume.

They probably want something shiny.

A Mime in a Box

Picking up on a thread by Andy the IT Guy, which of these things is not like the other?

  1. 3087641982_3edc49c7c9_mA developer who doesn’t understand databases, networks or firewalls.
  2. A system manager or DBA who doesn’t understand applications, networks and firewalls.
  3. A firewall or network administrator who doesn’t understand operating systems and applications.
  4. A mime in a box.

Trick question. They’re the same. The mime’s box is imaginary, as are the cross disciplinary restrictions that we place on developers, system and network administrators.

In the example from Andy’s post, the developer didn’t understand the difference between an app installed on a desktop and an app installed on a server. Similarly, non-network people often don’t understand the critical difference between source and destination when an app server connects to a database.

For example, I often see this diagram:

  app--database

showing an application updating a database, when from a network point of view, what we really need to see is:

 app-database

showing the application making a network connection to the database. But that subtle difference doesn’t mean much unless the person understands firewalls. They’ll need to understand them though, because I’m going to do this:

 app-database-firewall

If they don’t understand the difference between TCP and UCP, between Inbound and Outbound and between Source and Destination, that firewall is probably going to break things.

This problem seems to occur nearly universally.

Let's call this System Management Principle #5:

Each technology specialist must understand enough about the adjoining technologies to design and build systems that make maximum use of those technologies. 

(If you’ve got a better way of phrasing that, let me know.)


I’ve got System Management Principle #6, and now, with Andy’s help, principle #5. Someday we’ll dream up the rest of them.
Photo by B. Tse, found here.

Home Server Energy Consumption

I'm moving toward 'less is more', where 'less' is measured in watts. Right now my entire home entertainment and technology stack uses about 150 watts total (server + network + storage + Sun Rays + laptops + wall warts). I no longer use the stereo or television -  that stack is unplugged and consuming zero energy, and I don’t have any watt sucking game consoles. My next iteration of home entertainment & technology should use about 25 watts for all servers and storage and about 20 watts for each user end point (laptop). The server and network should be the only devices that run continuously. End points should suspend and resume quickly and reliably so that no more than one is normally running at a time, so the net of all server, network and user devices should be under 50 watts.

Expecting Stewardship Without Understanding

What are the consequences of building a society where we rely on technology that we don’t understand? Is lack of stewardship one of those consequences?

From Wayne Porter:
Most people no longer understand anything about the technology they use everyday and because of this ignorance many people use it without good stewardship. We drive cars we cannot fix, eat food we cannot make or produce, and many operate in an environment they do not understand with a false sense of security. We run and gun this technology with fuel that has probably reached its peak point.
Can we expect people who don’t understand a technology to be good stewards of the technology?
Should we expect application developers, who largely don’t understand relational databases, database security, firewalls or networks, to write applications that rationally utilize or properly protect those resources? Should we expect ordinary computer users, who understand almost nothing of how their computers work, to operate their computers in a manner that protects them and us from themselves and the Internet?

For some technologies (automobiles for example) we’ve almost completely given up on users understanding the technology well enough to make rational decisions and exhibit good stewardship. Drivers will never understand tire contact patches and slip angles, so we give them speed limits, ABS brakes, stability control, crumple zones and air bags. Drivers don’t understand engines and engine maintenance, so we give them idiot lights and dashboard messages. Drivers don’t understand the consequences of fossil fuel consumption, so we legislate minimum mileage and emission standards. We force drivers to be good stewards whether they like it or not.

Home owners don’t understand strength of beams, dynamic wind loads and electricity's propensity to escape to the ground via the path of least resistance, so we have building codes and permits, building inspectors, fire inspectors, licensed contractors and tradesmen to force homeowners into reasonable stewardship of their property.

On the other hand, most computer users don’t have even a basic understanding of how their computer works, yet we give them administrative access, allow them to install random software from the Internet, and then somehow expect them to keep their computer secure and functional. We expect them to be good stewards of the technology and not allow their home computers to be malware infested botnet nodes without them having even a vague understanding of how their computer works.

That’s probably not going to work.

The Irony Of Preventing Security Failures



Gadi Evron muses about the possibility that a successful security program might result in result in difficulty justifying future spending.

The Irony Of Preventing Security Failures, Gadi Evron, Dark Reading

But what if nothing happens because we stopped it? That may be the most dangerous option in the long term […] The obvious risk is that the security industry will be accused of crying wolf and not believed next time when something serious happens.

Roll back to 2001 and the hype surrounding Code Red. The lead story on major news outlets was the impending implosion of the Internet. The Internet didn’t implode. The hype went away. Slammer circa 2003 snuck up on the world, wreaked havoc, major corporate networks imploded, the internet hiccupped for a few hours. I’d like to think that Code Red was pretty good at culling out the incompetent sysadmins and raising the awareness of patching and hardening amongst the competent but clueless, and that Slammer was pretty good at culling out the incompetent IT departments and raising the awareness of the clueless CIO’s and executives.

Do we need to fear our own success?

Here’s a proposal. Simply allow your peers (or competitors) to continue to fail at security, and use their failure to justify continuing to spend money on your own success. You shouldn’t have too much trouble finding peer failures to use as your benchmarks. I’m pretty sure that the average executive can observe the impact of security events on peers and competitors compared to the lack of similar internal events and associate the difference with the level of competence and funding of the internal IT.

If corporate executives can’t figure that out on their own, I’ll bet we can come up with a couple of power points with impressive looking but indecipherable charts to help them out.

Secret Questions are not a Secret

Technology Review took a look at an advance copy of a study that validates what Ms. Palin already knew. Secret questions don’t help much: 
In research to be presented at the IEEE Symposium on Security and Privacy this week […] the researchers found that 28 percent of the people who knew and were trusted by the study's participants could guess the correct answers to the participant's secret questions. Even people not trusted by the participant still had a 17 percent chance of guessing the correct answer to a secret question. 
This is a fundamental and well-known problem. Putting real numbers on it should help those who are in the design meeting where secret questions get brought up.

To re-hash the secret question problem, either I answer the questions correctly and risk a 1 in 5 chance that a stranger will guess them, or I fabricate unique, nonsensical answers. If the fabricated answers are such that they can’t be reasonably guessed, then there isn’t much chance that I’ll remember what I answered, so I’m stuck writing them down somewhere and tracking them for a decade or so.

Obviously there are solutions that I can implement myself, like using a password safe of some type to store the made-up questions and answers. But what about the vast majority of ordinary users? How many of them are going to set up a password safe, figure out how to keep it up to date, replicate it to a safe location and not lock themselves out? Not many. They’ll have no choice but to write everything down.

I can image trying to explain to non-technical users that they need to have made-up answers to made-up questions, and that the made up questions and answers must be unique for each on line account, and the questions and answers need to be atypical enough that someone close to them can’t guess the answers even if they know the questions, and that instead of writing the questions and answers down, they need to store the made up questions and answers in a magic piece of software called a ‘password safe’, and they need to put a really strong password that they’ll remember and nobody else will guess on the password safe, and that they can’t write that down either, and they need to replicate the password safe data file to some other media, and if they forget the password to the password safe or lose the password safe data file, they’ll lose access to just about everything.

“Hey ma – here’s what I need you to do…download something called password safe…”

“I already have a safe….”

There’s got to be a better way.

Hijacking a Botnet – What Can We Learn?

The Computer Security Group at UC Santa Barbara hijacked a botnet long enough to grab interesting data. But if are keeping up with security news, you already know that.

Among the findings:
  • Passwords tend to be weak (not new).
  • Passwords tend to get reused across multiple web sites (not new). 
  • Botnet sizes may be overestimated (interesting). 

Your Application is a Rotting Old Shack, Now What?

In response to A Shack in the Woods, Crumbling at the Core, colleague Jim Graves commented:
“…it only works if application owners are like long-term homeowners, not house flippers.”
Shack-Jason_PrattGood point. Who cares if the shack gets a cheap paint job instead of a foundation and a comprehensive re-modeling? Will the business owner know or care? Do the contractors you hired care? Are you going to be around long enough to care? Are you and your employees, managers and consultants acting as house job flippers, painting over the flaws so you can update your resumes, take the profits and move on?

Jim asks:
“Are long-term employees more likely to care about problems that may happen five years from now? Are Highly Paid Consultants much less likely to?” 
Good question. Suppose that I want to fix the shack. Maybe I’m tired of having to empty the buckets that catch the drips from the roof (or restart the J2EE app that runs itself out of database connections a couple times a week). If this repair is to be anything other than a paint-over, at least one element in the business owner-->employee-->consultant-->contractor chain will have to care enough about the application to ensure that the remodeling is oriented toward long term structural repairs and not just a paint-over. The other elements in the chain need to concur.

For much of what I host and/or manage, I’m on the second or third remodeling cycle. I’ve seen the consultants parachute in, slap a coat of paint on the turd and walk off with a half decade of my salary. I don’t like it. This puts me squarely in the camp of putting the effort into fixing foundations instead of slapping on paint and shingles. I’ve seen apps that have been around for ten years and two refresh cycles, have had ten million dollars spent on them and still have mold, rot and leaks from a decade ago. But they have a shiny new skin and state of the art UI. For wireframes, UI models, usability studies and really nice Power Points? Spare no expense. Do they have a sane data model and even trivial application security? Not even close. Granite countertops are in scope, fixing the foundation is out of scope.

Things like that make me crabby.

For now I’ll assume that the blame lies with IT. Somewhere along the line the non-technical business owners have been led to believe that the shiny face and the UI is the application, and that the foundational elements (back end code, databases, servers, networks and security) are invisible, unimportant, or otherwise non-essential.

One, Two, Buckle my Shoes. How many Laptops Can We Lose.

Last summer Dell commissioned a study[1] to determine how many laptops were lost/stolen at airports. The study reported 12000 lost laptops per week at US airports. The study was reported as fact just about everywhere, including a whole bunch of high-profile security related blogs. I did some quick mental math & thought 'where do they store them all? There must be a heck of a big pile of them somewhere...'. So I bookmarked the study and thought that it'd make a good data loss related blog post someday.