From theDailyWtf, a story about availability & SLA’s that’s worth a read about an impossible availability/SLA conundrum. It’s a good lead in to a couple of my rules of thumb.
“If you add a nine to the availability requirement, you’ll add a zero to the price.”
In other words, to go from 99.9% to 99.99% (adding a nine to the availability requirement), you’ll increase the cost of the project by a factor of 10 (adding a zero to the cost).
There is a certain symmetry to this. Assume that it’ll cost 20,000 to build the system to support three nines, then:
99.9 = 20,00099.99 = 200,000
99.999 = 2,000,000
The other rule of thumb that this brings up is
Each technology in the stack must be designed for one nine more than the overall system availability.
This one is simple in concept. If the whole system must have three nines, then each technology in the stack (DNS, WAN, firewalls, load balancers, switches, routers, servers, databases, storage, power, cooling, etc.) must be designed for four nines. Why? ‘cause your stack has about 10 technologies in a serial dependency chain, and each one of them contributes to the overall MTBF/MTTR. Of course you can over-design some layers of the stack and ‘reserve’ some outage time for other layers of the stack, but in the end, it all has to add up.
Obviously these are really, really, really rough estimates, but for a simple rule of thumb to use to get business units and IT work groups thinking about the cost and complexity of providing high availability, it’s close enough. When it comes time to sign the SLA, you will have to have real numbers.
More thoughts on availability, MTTR and MTBF: