Solaris Live Upgrade, Thin Servers and Upgrade Strategies

Sun recently upgraded Solaris Live Upgrade to permit live upgrades of servers with zones, and from what I can tell, seems to be positioning Live Upgrade as the standard method for patching and upgrading Solaris servers. This gives Solaris sysadmins another useful tool in the toolkit. For those who don't know Solaris:

Solaris Live Upgrade provides a method of upgrading a system while the system continues to operate. While your current boot environment is running, you can duplicate the boot environment, then upgrade the duplicate. Or, rather than upgrading, you can install a Solaris Flash archive on a boot environment.

The short explanation is that a system administrator can create a new boot disk from a running sever (or an image of a bootable server) while the server is running, then upgrade, patch or otherwise tweak the new boot disk, and during a maintenance window, re-boot the server from the new image. In theory, the downtime for a major Solaris upgrade, say from Solaris 9 to Solaris 10, is only the time that it takes to reboot the server. If the reboot fails, the old boot disk is still usable, untouched. The system administrator simply re-boots from the old boot disk. (the required change window would be the time to reboot, test, reboot. But the fallback is simple and low risk, so the window can still be shorter than other strategies.)

Other high-end operating systems support something like this, and many system administrators maintain multiple bootable disks in a server to protect themselves against various operating system failures. Sun's Live Upgrade makes maintaining the alternate boot disks relatively painless.

If you are in Windows land, imagine this: While your Windows 2003 server is running, copy your C: drive, including all registry entries, system files, tweaks, user files and configuration bits to an empty partition. Then while the server is still up & in production, without rebooting, upgrade that new partition to Windows 2008, preserving configs, tweaks, installed software and the registry. Then reboot the server to the new 2008 boot partition and have everything work. The down time would be the time it takes to re-boot, not the time it takes to do an in-place upgrade. And if it didn't work, a reboot back to the original 2003 installation puts you back were you were. Pretty cool.

We tried Live Upgrade when it first came out, but ran into enough limitations that we rarely used it for routine upgrades. Early versions couldn't upgrade logical volumes or servers with zones, and the workarounds for a live upgrade with either were pretty ugly. Also, by the time a server was ready for a major version upgrade, we usually were ready to replace hardware anyway. I've got one old dog of a server at home that has been live upgraded from Solaris 8, to 9, through the current Solaris 10, and a whole bunch of mid-cycle upgrades and patches in between, but because of the limitations and our aggressive hardware upgrade cycle, we've generally not used Live Upgrade on our critical production servers. We might re-try it now that zones and logical volumes are supported.

Alternative Upgrade Strategies

Don't upgrade. Reinstall. The thinner the server, the more attractive a non-upgrade strategy becomes. In the ideal thin server world, servers would be minimized to their bare essentials (the least-bit principle), configured identically, and deployed using standard operating system images. Applications would be similarly pre-configured, self contained, with all configuration and customization contained in well defined and well known files and directory structures. Operating systems, application software and application data would all be logically separated, never mixed together.

In this ideal world, a redundant server could be taken offline, booted from a new master operating system image, fully patched and configured, and have required applications deployed to the new server with some form of semi-automatic deployment process. Data would be on data volumes, untouched. Upgrades would be re-installs, not upgrades. Applications would be well understood and trivial to install and configure, and system administrators would know exactly what is required for an application to function. All the random cruff of dead files that accumulates on servers would get purged. Life would be good.

Obviously if you've ever installed to /usr/local/bin, or descended into CPAN hell, or if you've ever run the install wizard with your eyes closed and fingers crossed, you are not likely to have success with the 're-install everything' plan unless you've got duplicate hardware and can install and test in parallel.

We are starting to follow this strategy on a subset of our Solaris infrastructure. Some of our applications are literally deployed directly from Subversion to app servers, including binaries, war files, jars and all. So a WAN boot from flar, followed by a scripted zone creation and an SVN deploy of the entire application in theory will re-create the entire server, ready for production.

Throw New Hardware at the problem. For applications that use large numbers of redundant servers, a strategy is to build a new operating system on a new or spare server, re-install and test the application, and cut over to the new hardware and operating system. If the old hardware is still serviceable it can be re-used as the new hardware for the next server. Rolling upgrades, especially of redundant load balanced servers, can be pretty risk free.

Unfortunately major software and operating systems have conspired to make major upgrades difficult to do on clustered servers & databases. Dropping a Windows 2003 SQL server out of a cluster, reinstalling or upgrading it to 2008, and having it re-join the cluster, isn't an option. Workarounds that stop the cluster, manipulate data LUN's and bring up the old data on the new server are do-able though. That makes it possible to do major upgrades in short windows, but not 'zero window'.

Abandon in place. A valid strategy is to simply avoid major upgrades for the entire life cycle of the hardware. Operating system, application and hardware upgrades are decreed to be synchronous. Major operating system upgrades do not occur independently of hardware upgrades. Once a server is installed as O/S version N, it stays that way until it reaches the end of its life cycle, with patches only for security or bug fixes. This depends on the availability of long-term vendor support and requires that system managers maintain proficiency in older operating system versions. In this case, application life cycle and hardware life cycle are one in the same.

Upgrade in place, fingers crossed. This would probably be the least desirable option. Stop the server, insert the upgrade CD, and watch it upgrade (or trash) your server. The risk of failure is high, and the options for falling back are few. (Recover from tape anyone?). Odds are fair that your test lab server, if you have one, is just different enough that the upgrade tests aren't quite valid, and some upgrade or compatibility snafu will cause a headache. The risk can be mitigated somewhat by creating another bootable disk and storing it somewhere during the upgrade. The fall back in that case is to switch boot disks and reboot from the old disk.

Other options, or combinations of the above are possible.