Skip to main content

Maintenance, Downtime and Outages

Via Data Center Knowledge - Maintenance, Downtime and Outages a quote from Ken Brill of The Uptime Institute:

"The No. 1 reason for catastrophic facility failure is lack of electrical maintenance,” Brill writes. “Electrical connections need to be checked annually for hot spots and then physically tightened at least every three years. Many sites cannot do this because IT’s need for uptime and the facility department’s need for maintenance downtime are incompatible. Often IT wins, at least in the short term. In the long term, the underlying science of materials always wins.”

Of course the obvious solution is to have redundant power in the data center & perform the power maintenance on one leg of the power at a time. One of our leased spaces does that. The other of our leased spaces has cooling and power shutdowns often enough that we have a very well rehearsed shutdown & startup plan. The point is well taken though. If you don’t do routine maintenance, you can complain about certain types of failure.

In IBM's case, it was the routine electrical maintenance that caused the outage. Apparently IBM didn't build out sufficient power redundancy for Air New Zealand's mainframe. A routine generator test failed and Air NZ’s mainframe lost power.

Air New Zealand CEO Rob Fyfe wasn't happy:

"In my 30-year working career, I am struggling to recall a time where I have seen a supplier so slow to react to a catastrophic system failure such as this and so unwilling to accept responsibility and apologise[sic] to its client and its client's customers,"

"We were left high and dry and this is simply unacceptable. My expectations of IBM were far higher than the amateur results that were delivered yesterday, and I have been left with no option but to ask the IT team to review the full range of options available to us to ensure we have an IT supplier whom we have confidence in and one who understands and is fully committed to our business and the needs of our customers."

I wonder if Air NZ contracted with IBM for a Tier 4 data center and/or a hot site with remote clustering? If so, Rob Fyfe has a point. If Air NZ went the cheap route he really can't complain. It's not like data centers don't get affected by storms, rats, power outages, floods & earthquakes. Especially power, power, and occasionally fire, fire or cooling. Oh yea, and don't forget storage failures.

In the Air NZ case, the one hour power outage seems to have resulted in a six hour application outage. If you spend any time at all thinking about MTTR (Mean Time To Repair), having an application suite that takes 5 hours to recover from a one hour power failure isn't a well thought out architecture for a service as critical as an airline check in/ticketing/reservation system.

Unfortunately the aftermath of a power failure can be brutal. Even in our relatively simple environment, we spend at least a couple hours cleaning up after a power related outage, generally for a couple of reasons:

  • It’s the 21st century and we still have applications that aren't smart enough to recover from simple network and database connectivity errors. This is beyond dumb. It should matter what order that you start servers and processes. I keep thinking that we need to make developers desktops & test servers less reliable, just so they'll build better error handling into their apps.
  • It’s the 21st century and we still have software that doesn’t crash gracefully. Dumb software is expensive (Google thinks so…).
  • The larger the server, the longer it takes to boot. In some cases, boot time is so bad that you can't have an outage less than an hour.
  • It’s the 21st century and we still have complex interdependent scheduled jobs that need to be restarted in a coordinated (choreographed) dance.

An amusing aside, as I was using an online note taking tool to rough in this post, the provider (Ubernote) went offline. They came back on line about 10 minutes later - with about 10 minutes of data loss.

Comments

  1. All too often I think power is the failing point in data centres - particularly outsourced data centres.

    5+ years ago one data centre I had to visit was so overloaded, power-wise, that you could only turn the lights on for the bank directly above the servers you were working on, and they basically had operations or other stuff 24x7 monitoring power loads, with lists of hosts and systems that could be turned off in an emergency.

    In another case, a major outsourcer happily traded on the fact that they had three independent tracks that power took to come into their data centre. What they weren't so keen to mention was that all three came from the same grid substation...

    ReplyDelete
  2. We looked a couple years of WAN outages on a 80-site network. Fiber cuts were #1, power was #2, and hardware failures were a distant third.

    For our worst data center (the one with power and cooling problems), power is our biggest problem.

    ReplyDelete

Post a Comment

Popular posts from this blog

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:Failing to understand the difference between necessary and sufficient. A daily backup is necessary, b…

Ad-Hoc Verses Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to …

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:
In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. The…