Skip to main content

Solaris Live Upgrade, Thin Servers and Upgrade Strategies

Sun recently upgraded Solaris Live Upgrade to permit live upgrades of servers with zones, and from what I can tell, seems to be positioning Live Upgrade as the standard method for patching and upgrading Solaris servers. This gives Solaris sysadmins another useful tool in the toolkit. For those who don't know Solaris:
Solaris Live Upgrade provides a method of upgrading a system while the system continues to operate. While your current boot environment is running, you can duplicate the boot environment, then upgrade the duplicate. Or, rather than upgrading, you can install a Solaris Flash archive on a boot environment.
The short explanation is that a system administrator can create a new boot disk from a running sever (or an image of a bootable server) while the server is running, then upgrade, patch or otherwise tweak the new boot disk, and during a maintenance window, re-boot the server from the new image. In theory, the downtime for a major Solaris upgrade, say from Solaris 9 to Solaris 10, is only the time that it takes to reboot the server. If the reboot fails, the old boot disk is still usable, untouched. The system administrator simply re-boots from the old boot disk. (Actually the required change window would be the time to reboot, test, reboot. But the fallback is simple and low risk, so the window can still be shorter than other strategies.)

Other high-end operating systems support something like this, and many system administrators maintain multiple bootable disks in a server to protect themselves against various operating system failures. Sun's Live Upgrade makes maintaining the alternate boot disks relatively painless.

If you are in Windows land, imagine this: While your Windows 2003 server is running, copy your C: drive, including all registry entries, system files, tweaks, user files and configuration bits to an empty partition. Then while the server is still up & in production, without rebooting, upgrade that new partition to Windows 2008, preserving configs, tweaks, installed software and the registry. Then reboot the sever to the new 2008 boot partition, and have everything work. The down time would be the time it takes to re-boot, not the time it takes to do an in-place upgrade. And if it didn't work, a reboot back to the original 2003 installation puts you back were you were. Pretty cool.

We tried Live Upgrade when it first came out, but ran into enough limitations that we rarely used it for routine upgrades. Early versions couldn't upgrade logical volumes or servers with zones, and the workarounds for a live upgrade with either were pretty ugly. Also, by the time a server was ready for a major version upgrade, we usually were ready to replace hardware anyway. I've got one old dog of a server at home that has been live upgraded from Solaris 8, to 9, through the current Solaris 10, and a whole bunch of mid-cycle upgrades and patches in between, but because of the limitations and our aggressive hardware upgrade cycle, we've generally not used Live Upgrade on our critical production servers. We might re-try it, now that zones and logical volumes are supported.

Alternative Upgrade Strategies

Don't upgrade. Reinstall.
The thinner the server, the more attractive a non-upgrade strategy becomes. In the ideal thin server world, servers would be minimized to their bare essentials (the least-bit principle), configured identically, and deployed using standard operating system images. Applications would be similarly pre-configured, self contained, with all configuration and customization contained in well defined and well known files and directory structures. Operating systems, application software and application data would all be logically separated, never mixed together.

In this ideal world, a redundant sever could be taken off line, booted from a new master operating system image, fully patched and configured, and have required applications deployed to the new server with some form of semi automatic deployment process. Data would be on data volumes, untouched. Upgrades would be re-installs, not upgrades. Applications would be well understood and trivial to install and configure, and system administrators would know exactly what is required for an application to function. All the random cruff of dead files that accumulates on servers would get purged. Life would be good.

Obviously if you've ever installed to /usr/local/bin, or descended into CPAN hell, or if you've ever run the install wizard with your eyes closed and fingers crossed, you are not likely to have success with the 're-install everything' plan unless you've got duplicate hardware and can install and test in parallel.

We are starting to follow this strategy on a subset of our Solaris infrastructure. Some of our applications are literally deployed directly from Subversion to app servers, including binaries, war files, jars and all. So a WAN boot from flar, followed by a scripted zone creation and an SVN deploy of the entire application in theory will re-create the entire server, ready for production.

Throw New Hardware at the problem. For applications that use large numbers of redundant servers, a strategy is to build a new operating system on a new or spare server, re-install and test the application, and cut over to the new hardware and operating system. If the old hardware is still serviceable it can be re-used as the new hardware for the next server. Rolling upgrades, especially of redundant load balanced servers, can be pretty risk free.

Unfortunately major software and operating systems have conspired to make major upgrades difficult to do on clustered severs & databases. So dropping a Windows 2003 SQL server out of a cluster, reinstalling or upgrading it to 2008, and having it re-join the cluster, isn't an option. Workarounds that stop the cluster, manipulate data LUN's and bring up the old data on the new server are do-able though. That makes it possible to do major upgrades in short windows, but not 'zero window'.

Abandon in place. A valid strategy is to simply avoid major upgrades for the entire life cycle of the hardware. Operating system, application and hardware upgrades are decreed to be synchronous. Major operating system upgrades do not occur independently of hardware upgrades. Once a server is installed as O/S version N, it stays that way until it reaches the end of its life cycle, with patches only for security or bug fixes. This depends on the availability of long term vendor support, and requires that system managers maintain proficiency in older operating system versions. In this case, application life cycle and hardware life cycle are one in the same.

Upgrade in place, fingers crossed. This would probably be the least desirable option. Stop the server, insert the upgrade CD, and watch it upgrade (or trash) your server. The risk of failure is high, and the options for fall back are few. (Recover from tape anyone?). Odds are fair that your test lab sever, if you have one, is just different enough that the upgrade tests aren't quite valid, and some upgrade or compatibility snafu will cause a headache. The risk can be mitigated somewhat by creating another bootable disk and storing it somewhere during the upgrade. The fall back in that case is to switch boot disks and reboot from the old disk.

Other options, or combinations of the above are possible.


Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …