In the mid-2000's,
our organization started to get serious about disaster recovery. By that time
our core application was an e-learning application that was heavily used (a
hundred thousand students on a typical day). That app became critical to our
mission.
To bootstrap a DR capability we paid
consultants for what was at best a craptastic DR plan. The plan was not
implementable under any realistic scenario.
The consultants
ignored our total lack of a DR site, insisted that we could buy servers
overnight, and that because every server had its own tape drive, we could hire
an army of techs from Geek Squad and recover all servers simultaneously from
individual tape backups. Of course we had no failover site, no hardware, and we
had tape-changers and a Legato infrastructure that streamed and interleaved
multiple backups onto a single tape instead of individual tape drives in each
server. I couldn't imagine buying dozens of servers and successfully recovering
in any reasonable time frame. The consultants formally presented a 56 hour RTO to
our Leadership, when my own gantt charts showed a 3-week RTO after we had a DR site leased, a data center
network built, and hardware purchased and racked. So I pushed back hard - and
stopped getting invited to the meetings.
They used nice fonts
though. Give them credit for that.
After seeing where
consultants were taking us, I pushed our
organization toward full hardware and application redundancy and full data
center failover capability for all data center hosted systems. My goal was to
have two fully functional data centers, identically configured, identical hardware, full redundancy at the failover site, and near real-time data replication
between them, all matched to realistic and achievable RPO and RTO. My rational
was that an organization as small and under-resourced as ours would not be able
to built, maintain, and routinely test a disaster recovery site that was not
already built, running, and replicated;
and that the failover hardware would be usable as pre-production,
staging or some other purpose.
The way we
accomplished this was to tackle the longest lead-time constraints first,
starting with space. We learned that our partners at the State of Minnesota has
several thousand square feet of data center space sitting empty - as they had
just consolidated down to smaller mainframes. I offered to lease that space, and then worked with their electricians
to preposition the correct power under the floor, having them build out PDUs
and pigtails for the servers and storage that we'd parachute in if we had a
disaster. That took care of the longest lead time items - space and power. We then built out a data center network - stubbed out at first, but
eventually fully configured and routed to the backbone.
We then invested
heavily in failover hardware.
The 'full failover'
strategy meant that if a back end database required 'N' CPU's in production, we
had to purchase and maintain '2N' in each of primary and secondary data
centers, and in most cases a fraction of N in one or more QA and development
instances. The QA and development instances were configured behind a fully
redundant network stack that was used by the network team to QA network,
firewall and load balancer technologies.
As we cycled through
normal hardware replacement and rotation, we first filled out the failover data
center with one-generation old hardware, figuring that half a loaf was
better than none. Later we started buying and configuring identical hardware in
both data centers - ideally upgrading failover first, so we never were in a spot where failover was behind production.
Hardware vendors
loved us.
I felt strongly that
if we didn't use the failover environment regularly, it would fall behind
production and become unusable for failover - primarily because of
configuration rot. This meant that wherever possible we needed to automate the
configuration of devices and systems. It simply is not possible to ensure that
two systems are identical in any case where they are manually configured. I.E.
- you must have Structured
System Management - scripts, not clicks. For UNIX systems this was fairly
straight forward. For Windows, the options were few and painful. On the Windows systems we had far
more clicks than scripts.
We achieved usable
DR capability for our primary e-learning application in 2006, full capability
for that application in about 2009, and for our ERP some years later. The team
that ran the e-learning environment conducted a run-from-failover exercise annually,
so we were assured that we could meet our published RPO and RTO for that
application and its supporting technology.
Selling Disaster
Recovery is hard. Most teams did not buy into the 'Failover is a first class
citizen' mantra that I'd been preaching. For example, even though we had
identical failover hardware for the ERP, the ERP team did not maintain failover
in a fully configured state - often not even acknowledging the existence of the
failover servers - and hence was not capable of conducting a failover within a
reasonable RTO.
We did however - after a 6-month reconfiguration and testing effort - fail the ERP over to a new
data center, so we knew it was possible. That
effort required a reverse-engineering of an app (that we had written ourselves)
sufficiently so we understood exactly how it was configured. We were then able
to re-configure both production and failover identically, and successfully fail
over the application. The team that ran the app didn't think it could be done. My team proved them wrong.