Skip to main content

Thirty-four years - Building out Disaster Recovery (Part 6)


In the mid-2000's, our organization started to get serious about disaster recovery. By that time our core application was an e-learning application that was heavily used (a hundred thousand students on a typical day). That app became critical to our mission.

To bootstrap a DR capability we paid consultants for what was at best a craptastic DR plan. The plan was not implementable under any realistic scenario.

The consultants ignored our total lack of a DR site, insisted that we could buy servers overnight, and that because every server had its own tape drive, we could hire an army of techs from Geek Squad and recover all servers simultaneously from individual tape backups. Of course we had no failover site, no hardware, and we had tape-changers and a Legato infrastructure that streamed and interleaved multiple backups onto a single tape instead of individual tape drives in each server. I couldn't imagine buying dozens of servers and successfully recovering in any reasonable time frame. The consultants formally presented a 56 hour RTO to our Leadership, when my own gantt charts showed a 3-week RTO after we had a DR site leased, a data center network built, and hardware purchased and racked. So I pushed back hard - and stopped getting invited to the meetings.

They used nice fonts though. Give them credit for that.

After seeing where consultants were taking us, I pushed our organization toward full hardware and application redundancy and full data center failover capability for all data center hosted systems. My goal was to have two fully functional data centers, identically configured, identical hardware, full redundancy at the failover site, and near real-time data replication between them, all matched to realistic and achievable RPO and RTO. My rational was that an organization as small and under-resourced as ours would not be able to built, maintain, and routinely test a disaster recovery site that was not already built, running, and replicated;  and that the failover hardware would be usable as pre-production, staging or some other purpose.

The way we accomplished this was to tackle the longest lead-time constraints first, starting with space. We learned that our partners at the State of Minnesota has several thousand square feet of data center space sitting empty - as they had just consolidated down to smaller mainframes. I offered to lease that space, and then worked with their electricians to preposition the correct power under the floor, having them build out PDUs and pigtails for the servers and storage that we'd parachute in if we had a disaster. That took care of the longest lead time items - space and power. We then built out a data center network - stubbed out at first, but eventually fully configured and routed to the backbone.

We then invested heavily in failover hardware.

The 'full failover' strategy meant that if a back end database required 'N' CPU's in production, we had to purchase and maintain '2N' in each of primary and secondary data centers, and in most cases a fraction of N in one or more QA and development instances. The QA and development instances were configured behind a fully redundant network stack that was used by the network team to QA network, firewall and load balancer technologies.

As we cycled through normal hardware replacement and rotation, we first filled out the failover data center with one-generation old hardware, figuring that half a loaf was better than none. Later we started buying and configuring identical hardware in both data centers - ideally upgrading failover first, so we never were in a spot where failover was behind production.

Hardware vendors loved us.

I felt strongly that if we didn't use the failover environment regularly, it would fall behind production and become unusable for failover - primarily because of configuration rot. This meant that wherever possible we needed to automate the configuration of devices and systems. It simply is not possible to ensure that two systems are identical in any case where they are manually configured. I.E. - you must have Structured System Management - scripts, not clicks. For UNIX systems this was fairly straight forward. For Windows, the options were few and painful. On the Windows systems we had far more clicks than scripts.

We achieved usable DR capability for our primary e-learning application in 2006, full capability for that application in about 2009, and for our ERP some years later. The team that ran the e-learning environment conducted a run-from-failover exercise annually, so we were assured that we could meet our published RPO and RTO for that application and its supporting technology.

Selling Disaster Recovery is hard. Most teams did not buy into the 'Failover is a first class citizen' mantra that I'd been preaching. For example, even though we had identical failover hardware for the ERP, the ERP team did not maintain failover in a fully configured state - often not even acknowledging the existence of the failover servers - and hence was not capable of conducting a failover within a reasonable RTO.

We did however - after a 6-month reconfiguration and testing effort - fail the ERP over to a new data center, so we knew it was possible. That effort required a reverse-engineering of an app (that we had written ourselves) sufficiently so we understood exactly how it was configured. We were then able to re-configure both production and failover identically, and successfully fail over the application. The team that ran the app didn't think it could be done.  My team proved them wrong.

Comments

Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …