Friday, February 11, 2011

Backup Performance or Recovery Performance?

“There is not a guaranteed 1:1 mapping between backup and recovery performance…” Preston de Guise, “The Networker Blog

Prestons post reminded me of one of our attempts to build a sane disaster recovery plan. The attempt went something like this:

  1. Hire consultants
  2. Consultants interview key staff
  3. Consultants draft recovery plan
  4. Consultants present recovery plan to executives

In the general case, consultants may or may not add value to a process like this. Consultants are in it for the money. The distinguishing factor (in my eyes) is whether consultants are attempting to perform good, cost effect work such that they maintain a long term relationship with the organization, or whether  the consultants are attempting to extract maximum income from a particular engagement. There is a difference.

On this particular attempt, the consultants did a reasonably good job of building a process and documentation for declaring and event, notifying executives, decision makers and technical staff; and managing communication. The first fifty pages of the fifty thousand dollar document we generally useful. They fell down badly on page 51, where they described how we would recover our data center.

Their plan was:

  • choose a recover site
  • buy dozens of servers
  • hire an army of technicians (one for each server)
  • simultaneously recover each server from a dedicated tape drive that came pre-installed in each of the shiny new servers.
  • complete recovery in fifty seven hours

To emphasize to the executives how firm they were on the fifty seven hour recovery, they pasted Veritas specific server recovery documentation as an addendum to the fifty thousand dollar plan.

Unfortunately, their recovery plan bore no relationship to how we backed up our servers. That made it unusable.

Reality: at the time of the engagement:

  • we did not have a recovery site
  • we had not started looking for a recovery site
  • we did not have one tape drive per server. All backups were multiplexed onto four fiber channel attached tape drives
  • we did not have Veritas Netbackup, we had Legato Networker
  • we could not recover individual servers from individual tape drives. All backups jobs were multiplexed onto shared tapes
  • we could not recover dozens of servers simultaneously. All backups jobs were multiplexed onto shared tapes

Unfortunately, the executive layer heard ‘fifty seven hours’, declared victory and moved on.

I tried to feed the consultants useful information, such as the necessity of having a the SAN up first, the architecture of our Legato Networker system, the number of groups and pools, the single threaded nature of our server restores (vs the multi-threaded backups), the improbability of being able to purchase servers that exactly match our hardware (hence the unlikelihood of a successful bare metal recovery on new hardware), not having recovery site pre-planned, not having power and network at the recovery site, and various other failures of their plan.

You get the idea.

The consultants objected to my objections. They basically told me that their plan was perfect, and that it was proven so by it’s adoption by a very large nation wide electronics retailer headquartered nearby. I suggested that we prepare a realistic recovery plan, accounting for the above deficiencies, and that plan be substituted for the ‘fifty seven hours’ part of the consultants plan. The declared me to be a crackpot and ignored my objections.

Using what I thought were optimistic estimates for an actual recovery I built a marginally realistic Gantt chart. It looked something like this:

  • Order all new hardware – 48 hours. Including an HP EVA SAN and fiber channel switches, an HP GS160, DLT tape changers, A Sun E10K and miscellaneous SPARC & Wintel servers. Call in favors from vendors, beg, borrow or extra-legally appropriate hardware as necessary. HP had a program called ‘Recoverall’ that would have facilitated hardware replacement. Sun didn’t.
  • Locate new site – 48 hours. Call in favors from other state agencies, the governors office, other colleges and universities, and uncle Bob. Can be done in parallel with hardware ordering.
  • Provision new site with power, network, fiber channel – 72 hours. I’m optimistic. At the time (a half dozen years ago) we could have brought most systems up with a duct tape and bailing wire for a network, skipped inconveniences like VLAN’s and firewall rules. used gaffers tape to protect the fiber channel runs, etc.
  • Deliver and install hardware – 72 hours. (Optimistic).
  • Configure SAN fabric, zoning, LUN’s, tape drives, network – 12 hours.
  • Bootstrap Legato, connect up DLT drives, recover indexes – 8 hours.

Then (roughly a week into the recovery) we’d be able to start recovering individual servers. When estimating the server recovery times, I assumed:

  • that because we threaded all backups into four tape drives, and because each tape had multiple servers on it, that we’d only be able to recover four servers at a time.
  • that a server recovery would take twice as long as the server backup
  • that staff could only work 16 hours per day. If a server finished restoring while staff were sleeping, the next server recovery would start when staff woke up.

Throw in a few more assumptions, add a bit of friction, temper my optimism, and my Gantt chart showed three weeks as the best possible outcome. That’s quite a stretch from fifty seven hours.

The outcome of the consulting gig was generally a failure. Their plan was only partially useful. If we would have followed the plan, we would have known whom to call in a disaster, decision makers, communication plans, etc.,but we would not have had a usable plan for recovering a data center.

It wasn’t a total loss though. I used that analysis internally to convince management that given organizational expectations for recovery vs. the complexity of our applications, a pre-built fully redundant recovery site was the only valid option.

That’s the path we are taking.

1 comment:

  1. Great Post! I had a similar experince in my company with a CDP solution. The RTO was way bigger than what they initially told us.

    ReplyDelete