Thirty-four years in IT - System Administration, Backups, and Data Centers (Part 5)


As a side effect of building and running the backbone, I introduced UNIX systems into what was then a wholly VMS organization. We initially used Linux - roughly from 1994 - 1997, then over the next 20+ years, briefly migrated to Solaris x86, then to Solaris SPARC and back to Solaris x86/x64, and then back to Linux.

Our CIO at the time recognized that a pure VMS/RDB shop was not a valid long-term strategy and as a result had us host a UNIX/Oracle application on behalf of another organization as a part of building out a new capability that he recognized we'd need someday. As our VMS/RDB team didn't appreciate (or were genuinely hostile toward) non-VMS platforms, they declined to take on the building and management of UNIX/Oracle stack. So I and my team did.


My team eventually picked up responsibility for most of the enterprise wide application hosting. This included Windows and UNIX system administration, SQL Server and MySQL administration,  and (eventually) Oracle database administration. As we started managing those technologies we went through a process similar to our early network management work, where we systematized and rationalized the servers and applications, brought them all up to common operating system versions and patch levels, consistently configured file systems, created common application installations, etc. In many cases the most simple straightforward path was to reinstall the apps on new operating system instances, then migrate the application data. Often times the servers and applications that we inherited from other teams had so much entropy that more than one fresh install was necessary - the first to learn how the application worked, the second to systematize and optimize the servers and applications for long-term reliability and maintenance.

Inheriting poorly managed, unreliable systems was something that we did several times. In every case, the team from which we inherited the technology or platform had trouble with the basics - knowing which servers were where, what OS version/patch was installed, knowing where and how they were backed up, etc. Step one for us was discovery. On day one I asked my staff to tell me where every inherited server was, what it did, and whether or not it was backed up.

My team also took on the responsibility for managing data center infrastructure. At the time, the infrastructure was all one-off, ad hoc, with little or no predictability or community between racks, servers, etc. We came up with simple data center network design, common standards for building and powering racks, racking servers, routing power, network and fiber channel, etc. At all times we emphasized redundancy consistency, and simplicity and structure.

At one point I also moved our enterprise backup system into my team. I felt that the team that had the responsibility for the RPO and RTO of the app should also have responsibility for backup and recovery.  And of course I could sleep nights knowing that my team was running the backups. We redesigned the Legato Networker based system from the ground up, wrapped it in Perl scripts that covered us in places where Network fell short, and took on the painful task of managing tape-based backups.

To ensure that our backups were reliable, we preferred to incorporate data recovery into routine processes. For example, one of our apps needed a development database refreshed periodically. Even though we could have refreshed from production, we did not. Instead we refreshed from a randomly chosen recent backup and a randomly chosen point-in-time, thereby exercising full database recovery every few weeks or months. When we had instances where we had to perform point-in-time recovery for production systems, we were able to recover easily.

I also drove home the importance of recoverable backups by regularly, first thing in the morning, asking my sysadmins and DBA's how the backups went last night. I set an expectation that by the time I came in, they'd better know. To further emphasize the importance of backups, I used to ask my sysadmins to delete a mailbox or directory structure while I watched, and then recover whatever they'ed deleted using last nights backup. If they hesitated, I knew that they were not confident in their backups.

We also built robust remote management into every device in both the data center and backbone. Every serial console was attached to a network-enabled terminal server, every keyboard and monitor was attached to a network-enabled KVM, and every server chassis had its lights-out board fully functional. The network interfaces to the remote consoles were attached to partner networks - so that even if we borked up our data center or backbone network completely, we could probably recover it without going on site.

My goal was to minimized the necessity to visit the data center and maximized our ability to work remotely, including from home. Once we got fully remote-capable, we were able to perform major upgrades, database and server migrations and data center failovers while working remotely.

During the period when Solaris was making great strides in advancing the state of art in UNIX systems, we fully exercised the advanced feature of Solaris 10. We fully adopted ZFS, zones, live migrations, resource management, etc. For this I credit our top-notch lead UNIX sysadmin, whose skills are equal to anyone, anywhere. We also pushed Solaris hard enough to uncover a couple of catastrophic ZFS bugs - resulting in corrupted ZFS file systems and full point-in-time database recoveries.

Once Oracle bought Sun Microystems, I stopped investing in Solaris.

As a side-effect of hosting a particular application, we introduced content-aware load balancing. We ended up with NetScaler load balancers - which turned out to be a very good choice. We quickly implemented a standard that required all applications to be layer-4+ load balanced, even if they were single-server, non-redundant. The load balancers were implemented as reverse proxies with SSL termination, content awareness and URL filtering. Our goal was that no application or server could be visible to the Internet without a load balancer configuration to that application or server.

The load balancers therefor provided a strict control plane that managed access to the application and an extremely useful layer of abstraction and isolation between users and the server(s) that hosted the applications. At first - in the early 2000's - most applications balked at being hosted behind a proxy. We often had to reverse-engineer the vendor application sufficiently to make it work in our environment.

The combination of outbound default-deny on the data center firewalls and the reverse proxy layer were instrumental in helping secure the applications. In many cases we were able to analyze the latest  vulnerabilities and determine that in our strictly controlled environment, the attack vector was not viable. That allowed us to be far more thoughtful and rational about when and how to accelerate patching and vulnerability management.

Part 4 - Security and Firewalling
Part 6 - Building out Disaster Recovery