As a side effect of
building and running the backbone, I introduced UNIX systems into what was then
a wholly VMS organization. We initially used Linux - roughly from 1994 - 1997, then over the next 20+ years, briefly migrated to Solaris x86, then to Solaris SPARC and back to Solaris x86/x64, and then back to Linux.
Our CIO at the time
recognized that a pure VMS/RDB shop was not a valid long-term strategy and as a
result had us host a UNIX/Oracle application on behalf of another organization as a part of building out a new capability that he recognized we'd need someday.
As our VMS/RDB team didn't appreciate (or were genuinely hostile toward) non-VMS
platforms, they declined to take on the building and management of UNIX/Oracle
stack. So I and my team did.
My team eventually
picked up responsibility for most of the enterprise wide application hosting.
This included Windows and UNIX system administration, SQL Server and MySQL
administration, and (eventually) Oracle
database administration. As we started managing those technologies we went
through a process similar to our early network management work, where we
systematized and rationalized the servers and applications, brought them all up
to common operating system versions and patch levels, consistently configured
file systems, created common application installations, etc. In many cases the
most simple straightforward path was to reinstall the apps on new operating
system instances, then migrate the application data. Often times the servers
and applications that we inherited from other teams had so much entropy that more than one fresh install was necessary - the first to learn how the
application worked, the second to systematize and optimize the servers and applications
for long-term reliability and maintenance.
Inheriting poorly
managed, unreliable systems was something that we did several times. In every
case, the team from which we inherited the technology or platform had trouble
with the basics - knowing which servers were where, what OS version/patch was
installed, knowing where and how they were backed up, etc. Step one for us was
discovery. On day one I asked my staff to tell me where every inherited server
was, what it did, and whether or not it was backed up.
My team also took on
the responsibility for managing data center infrastructure. At the time, the
infrastructure was all one-off, ad hoc, with little or no predictability or
community between racks, servers, etc. We came up with simple data center
network design, common standards for building and powering racks, racking
servers, routing power, network and fiber channel, etc. At all times we
emphasized redundancy consistency, and simplicity and structure.
At one point I also
moved our enterprise backup system into my team. I felt that the team that had
the responsibility for the RPO and RTO of the app should also have responsibility for backup and recovery.
And of course I could sleep nights knowing that my team was running the backups. We redesigned the Legato
Networker based system from the ground up, wrapped it in Perl scripts that
covered us in places where Network fell short, and took on the painful task of
managing tape-based backups.
To ensure that our
backups were reliable, we preferred to incorporate data recovery into routine
processes. For example, one of our apps needed a development database refreshed
periodically. Even though we could have refreshed from production, we did not.
Instead we refreshed from a randomly chosen recent backup and a randomly chosen
point-in-time, thereby exercising full database recovery every few weeks or
months. When we had instances where we had to perform point-in-time
recovery for production systems, we were able to recover easily.
I also drove home the importance of recoverable backups by regularly, first thing in the morning, asking my sysadmins and DBA's how the backups went last night. I set an expectation that by the time I came in, they'd better know. To further emphasize the importance of backups, I used to ask my sysadmins to delete a mailbox or directory structure while I watched, and then recover whatever they'ed deleted using last nights backup. If they hesitated, I knew that they were not confident in their backups.
We also built robust
remote management into every device in both the data center and backbone. Every
serial console was attached to a network-enabled terminal server, every
keyboard and monitor was attached to a network-enabled KVM, and every server
chassis had its lights-out board fully functional. The network
interfaces to the remote consoles were attached to partner networks - so that
even if we borked up our data center or backbone network completely, we could
probably recover it without going on site.
My goal was to minimized the necessity to visit the data center and maximized our ability to work remotely, including from home. Once we got fully remote-capable, we were able to perform major upgrades, database and server migrations and data center failovers while working remotely.
During the period when Solaris was making great strides in advancing the state of art in UNIX systems, we fully exercised the advanced feature of Solaris 10. We fully adopted ZFS, zones, live migrations, resource management, etc. For this I credit our top-notch lead UNIX sysadmin, whose skills are equal to anyone, anywhere. We also pushed Solaris hard enough to uncover a couple of catastrophic ZFS bugs - resulting in corrupted ZFS file systems and full point-in-time database recoveries.
During the period when Solaris was making great strides in advancing the state of art in UNIX systems, we fully exercised the advanced feature of Solaris 10. We fully adopted ZFS, zones, live migrations, resource management, etc. For this I credit our top-notch lead UNIX sysadmin, whose skills are equal to anyone, anywhere. We also pushed Solaris hard enough to uncover a couple of catastrophic ZFS bugs - resulting in corrupted ZFS file systems and full point-in-time database recoveries.
Once Oracle bought
Sun Microystems, I stopped investing in Solaris.
As a side-effect of
hosting a particular application, we introduced content-aware load balancing.
We ended up with NetScaler load balancers - which turned out to be a very good
choice. We quickly implemented a standard that required all applications to be
layer-4+ load balanced, even if they were single-server, non-redundant. The
load balancers were implemented as reverse proxies with SSL termination,
content awareness and URL filtering. Our goal was that no application or server could be visible
to the Internet without a load balancer configuration to that application or
server.
The load balancers
therefor provided a strict control plane that managed access to the application
and an extremely useful layer of abstraction and isolation between users and
the server(s) that hosted the applications. At first - in the early 2000's - most
applications balked at being hosted behind a proxy. We often had to
reverse-engineer the vendor application sufficiently to make it work in our
environment.
The combination of
outbound default-deny on the data center firewalls and the reverse proxy layer
were instrumental in helping secure the applications. In many cases we were
able to analyze the latest
vulnerabilities and determine that in our strictly controlled
environment, the attack vector was not viable. That allowed us to be far more
thoughtful and rational about when and how to accelerate patching and
vulnerability management.
Part 4 - Security and Firewalling
Part 6 - Building out Disaster Recovery