Skip to main content

Backup Performance or Recovery Performance?

“There is not a guaranteed 1:1 mapping between backup and recovery performance…” Preston de Guise, “The Networker Blog

Prestons post reminded me of one of our attempts to build a sane disaster recovery plan. The attempt went something like this:

  1. Hire consultants
  2. Consultants interview key staff
  3. Consultants draft recovery plan
  4. Consultants present recovery plan to executives

In the general case, consultants may or may not add value to a process like this. Consultants are in it for the money. The distinguishing factor (in my eyes) is whether consultants are attempting to perform good, cost effect work such that they maintain a long term relationship with the organization, or whether  the consultants are attempting to extract maximum income from a particular engagement. There is a difference.

On this particular attempt, the consultants did a reasonably good job of building a process and documentation for declaring and event, notifying executives, decision makers and technical staff; and managing communication. The first fifty pages of the fifty thousand dollar document we generally useful. They fell down badly on page 51, where they described how we would recover our data center.

Their plan was:

  • choose a recover site
  • buy dozens of servers
  • hire an army of technicians (one for each server)
  • simultaneously recover each server from a dedicated tape drive that came pre-installed in each of the shiny new servers.
  • complete recovery in fifty seven hours

To emphasize to the executives how firm they were on the fifty seven hour recovery, they pasted Veritas specific server recovery documentation as an addendum to the fifty thousand dollar plan.

Unfortunately, their recovery plan bore no relationship to how we backed up our servers. That made it unusable.

Reality: at the time of the engagement:

  • we did not have a recovery site
  • we had not started looking for a recovery site
  • we did not have one tape drive per server. All backups were multiplexed onto four fiber channel attached tape drives
  • we did not have Veritas Netbackup, we had Legato Networker
  • we could not recover individual servers from individual tape drives. All backups jobs were multiplexed onto shared tapes
  • we could not recover dozens of servers simultaneously. All backups jobs were multiplexed onto shared tapes

Unfortunately, the executive layer heard ‘fifty seven hours’, declared victory and moved on.

I tried to feed the consultants useful information, such as the necessity of having a the SAN up first, the architecture of our Legato Networker system, the number of groups and pools, the single threaded nature of our server restores (vs the multi-threaded backups), the improbability of being able to purchase servers that exactly match our hardware (hence the unlikelihood of a successful bare metal recovery on new hardware), not having recovery site pre-planned, not having power and network at the recovery site, and various other failures of their plan.

You get the idea.

The consultants objected to my objections. They basically told me that their plan was perfect, and that it was proven so by it’s adoption by a very large nation wide electronics retailer headquartered nearby. I suggested that we prepare a realistic recovery plan, accounting for the above deficiencies, and that plan be substituted for the ‘fifty seven hours’ part of the consultants plan. The declared me to be a crackpot and ignored my objections.

Using what I thought were optimistic estimates for an actual recovery I built a marginally realistic Gantt chart. It looked something like this:

  • Order all new hardware – 48 hours. Including an HP EVA SAN and fiber channel switches, an HP GS160, DLT tape changers, A Sun E10K and miscellaneous SPARC & Wintel servers. Call in favors from vendors, beg, borrow or extra-legally appropriate hardware as necessary. HP had a program called ‘Recoverall’ that would have facilitated hardware replacement. Sun didn’t.
  • Locate new site – 48 hours. Call in favors from other state agencies, the governors office, other colleges and universities, and uncle Bob. Can be done in parallel with hardware ordering.
  • Provision new site with power, network, fiber channel – 72 hours. I’m optimistic. At the time (a half dozen years ago) we could have brought most systems up with a duct tape and bailing wire for a network, skipped inconveniences like VLAN’s and firewall rules. used gaffers tape to protect the fiber channel runs, etc.
  • Deliver and install hardware – 72 hours. (Optimistic).
  • Configure SAN fabric, zoning, LUN’s, tape drives, network – 12 hours.
  • Bootstrap Legato, connect up DLT drives, recover indexes – 8 hours.

Then (roughly a week into the recovery) we’d be able to start recovering individual servers. When estimating the server recovery times, I assumed:

  • that because we threaded all backups into four tape drives, and because each tape had multiple servers on it, that we’d only be able to recover four servers at a time.
  • that a server recovery would take twice as long as the server backup
  • that staff could only work 16 hours per day. If a server finished restoring while staff were sleeping, the next server recovery would start when staff woke up.

Throw in a few more assumptions, add a bit of friction, temper my optimism, and my Gantt chart showed three weeks as the best possible outcome. That’s quite a stretch from fifty seven hours.

The outcome of the consulting gig was generally a failure. Their plan was only partially useful. If we would have followed the plan, we would have known whom to call in a disaster, decision makers, communication plans, etc.,but we would not have had a usable plan for recovering a data center.

It wasn’t a total loss though. I used that analysis internally to convince management that given organizational expectations for recovery vs. the complexity of our applications, a pre-built fully redundant recovery site was the only valid option.

That’s the path we are taking.

Comments

  1. Great Post! I had a similar experince in my company with a CDP solution. The RTO was way bigger than what they initially told us.

    ReplyDelete

Post a Comment

Popular posts from this blog

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:Failing to understand the difference between necessary and sufficient. A daily backup is necessary, b…

Ad-Hoc Verses Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to …

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:
In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. The…