Skip to main content

Thirty-four years - System Administration, Backups, and Data Centers (Part 5)


As a side effect of building and running the backbone, I introduced UNIX systems into what was then a wholly VMS organization. We initially used Linux - roughly from 1994 - 1997, then over the next 20+ years, briefly migrated to Solaris x86, then to Solaris SPARC and back to Solaris x86/x64, and then back to Linux.

Our CIO at the time recognized that a pure VMS/RDB shop was not a valid long-term strategy and as a result had us host a UNIX/Oracle application on behalf of another organization as a part of building out a new capability that he recognized we'd need someday. As our VMS/RDB team didn't appreciate (or were genuinely hostile toward) non-VMS platforms, they declined to take on the building and management of UNIX/Oracle stack. So I and my team did.

My team eventually picked up responsibility for most of the enterprise wide application hosting. This included Windows and UNIX system administration, SQL Server and MySQL administration,  and (eventually) Oracle database administration. As we started managing those technologies we went through a process similar to our early network management work, where we systematized and rationalized the servers and applications, brought them all up to common operating system versions and patch levels, consistently configured file systems, created common application installations, etc. In many cases the most simple straightforward path was to reinstall the apps on new operating system instances, then migrate the application data. Often times the servers and applications that we inherited from other teams had so much entropy that more than one fresh install was necessary - the first to learn how the application worked, the second to systematize and optimize the servers and applications for long-term reliability and maintenance.

Inheriting poorly managed, unreliable systems was something that we did several times. In every case, the team from which we inherited the technology or platform had trouble with the basics - knowing which servers were where, what OS version/patch was installed, knowing where and how they were backed up, etc. Step one for us was discovery. On day one I asked my staff to tell me where every inherited server was, what it did, and whether or not it was backed up.

My team also took on the responsibility for managing data center infrastructure. At the time, the infrastructure was all one-off, ad hoc, with little or no predictability or community between racks, servers, etc. We came up with simple data center network design, common standards for building and powering racks, racking servers, routing power, network and fiber channel, etc. At all times we emphasized redundancy consistency, and simplicity and structure.

At one point I also moved our enterprise backup system into my team. I felt that the team that had the responsibility for the RPO and RTO of the app should also have responsibility for backup and recovery.  And of course I could sleep nights knowing that my team was running the backups. We redesigned the Legato Networker based system from the ground up, wrapped it in Perl scripts that covered us in places where Network fell short, and took on the painful task of managing tape-based backups.

To ensure that our backups were reliable, we preferred to incorporate data recovery into routine processes. For example, one of our apps needed a development database refreshed periodically. Even though we could have refreshed from production, we did not. Instead we refreshed from a randomly chosen recent backup and a randomly chosen point-in-time, thereby exercising full database recovery every few weeks or months. When we had instances where we had to perform point-in-time recovery for production systems, we were able to recover easily.

I also drove home the importance of recoverable backups by regularly, first thing in the morning, asking my sysadmins and DBA's how the backups went last night. I set an expectation that by the time I came in, they'd better know. To further emphasize the importance of backups, I used to ask my sysadmins to delete a mailbox or directory structure while I watched, and then recover whatever they'ed deleted using last nights backup. If they hesitated, I knew that they were not confident in their backups.

We also built robust remote management into every device in both the data center and backbone. Every serial console was attached to a network-enabled terminal server, every keyboard and monitor was attached to a network-enabled KVM, and every server chassis had its lights-out board fully functional. The network interfaces to the remote consoles were attached to partner networks - so that even if we borked up our data center or backbone network completely, we could probably recover it without going on site.

My goal was to minimized the necessity to visit the data center and maximized our ability to work remotely, including from home. Once we got fully remote-capable, we were able to perform major upgrades, database and server migrations and data center failovers while working remotely.

During the period when Solaris was making great strides in advancing the state of art in UNIX systems, we fully exercised the advanced feature of Solaris 10. We fully adopted ZFS, zones, live migrations, resource management, etc. For this I credit our top-notch lead UNIX sysadmin, whose skills are equal to anyone, anywhere. We also pushed Solaris hard enough to uncover a couple of catastrophic ZFS bugs - resulting in corrupted ZFS file systems and full point-in-time database recoveries.

Once Oracle bought Sun Microystems, I stopped investing in Solaris.

As a side-effect of hosting a particular application, we introduced content-aware load balancing. We ended up with NetScaler load balancers - which turned out to be a very good choice. We quickly implemented a standard that required all applications to be layer-4+ load balanced, even if they were single-server, non-redundant. The load balancers were implemented as reverse proxies with SSL termination, content awareness and URL filtering. Our goal was that no application or server could be visible to the Internet without a load balancer configuration to that application or server.

The load balancers therefor provided a strict control plane that managed access to the application and an extremely useful layer of abstraction and isolation between users and the server(s) that hosted the applications. At first - in the early 2000's - most applications balked at being hosted behind a proxy. We often had to reverse-engineer the vendor application sufficiently to make it work in our environment.

The combination of outbound default-deny on the data center firewalls and the reverse proxy layer were instrumental in helping secure the applications. In many cases we were able to analyze the latest  vulnerabilities and determine that in our strictly controlled environment, the attack vector was not viable. That allowed us to be far more thoughtful and rational about when and how to accelerate patching and vulnerability management.

Part 4 - Security and Firewalling
Part 6 - Building out Disaster Recovery


Comments

Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …