Simple Steps to Improving Availability - Five Essential Transitions

A poorly managed high availability cluster will have lower availability than a properly managed non-redundant system.

That's a bold statement, but I'm pretty sure it's true. The bottom line is that the path to improving system availability begins with the fundamentals of system management, not with redundant or HA systems. Only after you have executed the fundamentals will clustering or high availability make a positive contribution to system availability.

Here's the five transitions that are the critical steps on the path to improved availability:

Transition #1: From ad-hoc system management to structured system management.

Structured system management implies that you understand the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications, that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. Ad hoc system management doesn't cut it.

Transition #2: From ad-hoc changes to simple change management.

Simple change management means that you have controls around changes sufficient to determine who/what/when/why on any change to any system or application critical file. Changes are predicted. Changes are documented. Changes are not random, and they do not 'just happen'. A text file 'changes.txt' edited with notepad.exe and stored in c:/changelog/ is not as comprehensive as a million dollar consultant-driven enterprise CMDB that takes years to implement, but is a huge step in the right direction, and certainly provides more incremental value at less cost than the big solution.

Transition #3: From 'i dunno....maybe.....' to root cause analysis.

Failures have a cause. All of them. The 'cosmic ray did it' excuse is bullshit. Find the root cause. Fix the core problem. You need to be able to determine that 'the event was caused by .... and can be resolved by ... and can be prevented from ever happening again by ...'. If you cannot find the cause and you have to resort to killing a process or rebooting a server to restore service, then you must add instrumentation, monitoring or debugging to your system sufficient so that the next time the event happens, you will find the cause.

Transition #4: From 'try it...I think it will work..' to 'my tests show that......'

Comprehensive pre-production testing ensures that the systems that you build and the changes that you make will work as expected. You know that they will work because you tested them, and in the rare case that they do not work as expected, you will be able do determine the variation between test and production devise a test that accommodates the differences.

Transition #5: From non-redundant systems to simple redundancy.

Finally, after you've made transitions one through four, you are ready for implementation of basic active/passive redundancy. Skipping ahead to transition #5 isn't going to get you to your availability goals any sooner.

Remember though, keep it simple. Complexity doesn't necessarily increase availability.


Virtualization Security

Theo's thoughts on virtualization.

"x86 virtualization is about basically placing another nearly full
kernel, full of new bugs, on top of a nasty x86 architecture which
barely has correct page protection.  Then running your operating
system on the other side of this brand new pile of shit."



ARS Technica Articles on Commodore Amiga

Jeremy Reimer of ars technica is running an interesting series of articles on the birth and death of the Commodore Amiga.
It handled graphics, sound, and video as easily as other computers of its time manipulated plain text. It was easily ten years ahead of its time. It was everything its designers imagined it could be, except for one crucial problem: the world was essentially unaware of its existence.
We pretty much laughed at the primitive video capabilities of 1985-1990 Mac's and PC's as compared to an Amiga. Those two platforms were pathetic. And when we combined a $1500 Amiga with a $1500 Video Toaster card, we had a decent video editing machine nearly as functional as dedicated editors costing far, far more. My guess is that few people today realize how far advanced that computer was compared to its contemporaries, and how long it took the rest of the industry to catch up. I use Windows Movie Maker and iMovie to play around with video, and other than the vast improvements in storage that allow direct editing of on-disk digital video files, the capabilities of today's software isn't that much greater than the 20 year old Amiga. We should really be ashamed at how little progress we made in two decades.

It's worth a read.

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. 

Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

Asynchronous Information Consumption - Random Thoughts

We spend lots of money on various forms of asynchronous media converters. People seem to be moving their media and information consumption habits, and perhaps even a portion of our social habits, toward asynchronous, store-and-forward type mechanisms.

I find the shift to asynchronous consumption interesting. When time independence is combined with the location independence that comes from being un-tethered, a fundamentally different lifestyle results.

Interesting examples, in no particular order:

DVR's convert synchronous media into asynchronous media. I find it wonderfully ironic that we send film crews to deserted islands for weeks, filming ordinary people doing un-ordinary things to each other, send the film back to a studio, where several months are spent slicing it up and gluing it back together to make it somehow more dramatic and presumably more watchable, then synchronously spew it out over the air every Tuesday at 8pm to an audience of twenty million people who are leaning forward from their couches, holding their breaths, remotes in hand, waiting for their DVR's to record the show so they can watch it on Thursday at 10pm.

Broadcast networks are extraordinarily good at delivering media synchronously to huge numbers of people at exactly the same time. Some of us have images from the 1950's of a family around the TV, waiting for the test pattern to morph into the Lone Ranger. The rest of us can google for images from the 1950's of a family around the TV waiting for the test pattern to morph into the Lone Ranger. And millions of media consumers have become extraordinarily good at synchronizing their wind-up clocks, their farm chores, their meals and their lives to the broadcast schedules of first AM radio, then television. Yes, my family, in the 1960's, scheduled our evening meal around the Huntley-Brinkley Report. Dinner never was allowed to intrude on the news. And the news was never a minute late. Producers would have gotten fired if that ever happened.

CNN's Headline News is a strange hybrid attempt to make a synchronous delivery mechanism appear to by asynchronous, by endlessly repeating the same 3 minutes of news and 7 minutes of advertising at 10 minute intervals.

Movies in the theater are synchronous. Video rentals are asynchronous. You don't have to arrive on time for a rental. Netflix is asynchronous, and you don't even have to arrive at all.

Magazines and newspapers
are some sort of a hybrid delivery mechanism. They are edited and printed and delivered synchronously, but consumed asynchronously. My 'to be read' pile often has a months worth of weekly magazines piled up. The editors, press operators, truck drivers and postal carriers who broke their backs getting the press run printed, bound and delivered on time must not be happy with me for waiting a month to read their work.

Twitter is asynchronous, instant messaging is synchronous. Blogs are asynchronous, except for really popular blogs, where a small number of people with apparently nothing else to do lean forward from their chairs, holding their breaths, mouse in hand, waiting for the blogger whom they worship to post a new article so they can be the first with an irrelevant reply.

Podcasting is completely asynchronous. NPR and BBC podcasting effectively saves me the trouble of having to get a Tivo for my radio. In fact if media companies simply podcasted everything, appropriately wrapped up in DRM and embedded with advertising, we'd all be able to toss our DVR's.

E-mail is supposed to be asynchronous. Unfortunately masses of workers put e-mail pop-up notification on their desktops and immediately read and respond to all e-mails, effectively turning a store-and-forward asynchronous media into a synchronous real-time conversation. Didn't we invent chat for that? E-mail threads though, replace synchronous meetings, even when they shouldn't.

Second Life is some kind of anomaly. There is no store-and-forward at all, yet it is still popular.

Traditional Learning is horribly synchronous. Report to class every morning at 08:05am. Leave for your next class at 08:55am. Report to your next class at 09:05am. Repeat. Repeat. Repeat. Repeat. Assignment #16 is due Tuesday, April 11th at 09:55. If you get behind, too bad for you. If you get ahead, shame on you. You are being far too disruptive. You need to stop that. If you are a grog in the morning and brilliant at night, too bad for you. Sounds synchronous to me.

On-line learning can be as synchronous or asynchronous as we decide to make it. On-line courses can, in theory be delivered asynchronously. Most are not. Students still have to structure the course start date, end date and most quizzes and assignments around certain time-date restrictions arbitrarily imposed on them by faculty and burrocrats.

It doesn't have to be that way. Hutchinson Technical College and a few other Minnesota tech colleges conducted a twenty year bold experiment in asynchronous education. They delivered high quality advanced education, including lectures, work assignments, quizzes and exams with almost no date-time restrictions, other than the restriction that they were only able to staff the labs between 7 am and 5 pm. Students could, if they so desired, start their semester any day of the week, any week of the year, complete any assignment, project, exam or quiz any time they felt ready, complete their program, get a degree and graduate the day that they met the program requirements. There were no arbitrary semester boundaries, other than an administrative billing boundary that occurred several times per year. (If you want to keep learning, you gotta keep paying....)

In other news, one of the largest higher education systems in the US finally achieved its decade old goal of a common calendar for all of its member colleges and universities. For the first time since its founding, all colleges and universities in the Minnesota State system will start and end their semesters on exactly the same date.

Solaris Live Upgrade, Thin Servers and Upgrade Strategies

Sun recently upgraded Solaris Live Upgrade to permit live upgrades of servers with zones, and from what I can tell, seems to be positioning Live Upgrade as the standard method for patching and upgrading Solaris servers. This gives Solaris sysadmins another useful tool in the toolkit. For those who don't know Solaris:
Solaris Live Upgrade provides a method of upgrading a system while the system continues to operate. While your current boot environment is running, you can duplicate the boot environment, then upgrade the duplicate. Or, rather than upgrading, you can install a Solaris Flash archive on a boot environment.
The short explanation is that a system administrator can create a new boot disk from a running sever (or an image of a bootable server) while the server is running, then upgrade, patch or otherwise tweak the new boot disk, and during a maintenance window, re-boot the server from the new image. In theory, the downtime for a major Solaris upgrade, say from Solaris 9 to Solaris 10, is only the time that it takes to reboot the server. If the reboot fails, the old boot disk is still usable, untouched. The system administrator simply re-boots from the old boot disk. (the required change window would be the time to reboot, test, reboot. But the fallback is simple and low risk, so the window can still be shorter than other strategies.)

Other high-end operating systems support something like this, and many system administrators maintain multiple bootable disks in a server to protect themselves against various operating system failures. Sun's Live Upgrade makes maintaining the alternate boot disks relatively painless.

If you are in Windows land, imagine this: While your Windows 2003 server is running, copy your C: drive, including all registry entries, system files, tweaks, user files and configuration bits to an empty partition. Then while the server is still up & in production, without rebooting, upgrade that new partition to Windows 2008, preserving configs, tweaks, installed software and the registry. Then reboot the server to the new 2008 boot partition and have everything work. The down time would be the time it takes to re-boot, not the time it takes to do an in-place upgrade. And if it didn't work, a reboot back to the original 2003 installation puts you back were you were. Pretty cool.

We tried Live Upgrade when it first came out, but ran into enough limitations that we rarely used it for routine upgrades. Early versions couldn't upgrade logical volumes or servers with zones, and the workarounds for a live upgrade with either were pretty ugly. Also, by the time a server was ready for a major version upgrade, we usually were ready to replace hardware anyway. I've got one old dog of a server at home that has been live upgraded from Solaris 8, to 9, through the current Solaris 10, and a whole bunch of mid-cycle upgrades and patches in between, but because of the limitations and our aggressive hardware upgrade cycle, we've generally not used Live Upgrade on our critical production servers. We might re-try it now that zones and logical volumes are supported.

Alternative Upgrade Strategies

Don't upgrade. Reinstall.
The thinner the server, the more attractive a non-upgrade strategy becomes. In the ideal thin server world, servers would be minimized to their bare essentials (the least-bit principle), configured identically, and deployed using standard operating system images. Applications would be similarly pre-configured, self contained, with all configuration and customization contained in well defined and well known files and directory structures. Operating systems, application software and application data would all be logically separated, never mixed together.

In this ideal world, a redundant server could be taken offline, booted from a new master operating system image, fully patched and configured, and have required applications deployed to the new server with some form of semi-automatic deployment process. Data would be on data volumes, untouched. Upgrades would be re-installs, not upgrades. Applications would be well understood and trivial to install and configure, and system administrators would know exactly what is required for an application to function. All the random cruff of dead files that accumulates on servers would get purged. Life would be good.

Obviously if you've ever installed to /usr/local/bin, or descended into CPAN hell, or if you've ever run the install wizard with your eyes closed and fingers crossed, you are not likely to have success with the 're-install everything' plan unless you've got duplicate hardware and can install and test in parallel.

We are starting to follow this strategy on a subset of our Solaris infrastructure. Some of our applications are literally deployed directly from Subversion to app servers, including binaries, war files, jars and all. So a WAN boot from flar, followed by a scripted zone creation and an SVN deploy of the entire application in theory will re-create the entire server, ready for production.

Throw New Hardware at the problem. For applications that use large numbers of redundant servers, a strategy is to build a new operating system on a new or spare server, re-install and test the application, and cut over to the new hardware and operating system. If the old hardware is still serviceable it can be re-used as the new hardware for the next server. Rolling upgrades, especially of redundant load balanced servers, can be pretty risk free.

Unfortunately major software and operating systems have conspired to make major upgrades difficult to do on clustered servers & databases. Dropping a Windows 2003 SQL server out of a cluster, reinstalling or upgrading it to 2008, and having it re-join the cluster, isn't an option. Workarounds that stop the cluster, manipulate data LUN's and bring up the old data on the new server are do-able though. That makes it possible to do major upgrades in short windows, but not 'zero window'.

Abandon in place. A valid strategy is to simply avoid major upgrades for the entire life cycle of the hardware. Operating system, application and hardware upgrades are decreed to be synchronous. Major operating system upgrades do not occur independently of hardware upgrades. Once a server is installed as O/S version N, it stays that way until it reaches the end of its life cycle, with patches only for security or bug fixes. This depends on the availability of long-term vendor support and requires that system managers maintain proficiency in older operating system versions. In this case, application life cycle and hardware life cycle are one in the same.

Upgrade in place, fingers crossed. This would probably be the least desirable option. Stop the server, insert the upgrade CD, and watch it upgrade (or trash) your server. The risk of failure is high, and the options for falling back are few. (Recover from tape anyone?). Odds are fair that your test lab server, if you have one, is just different enough that the upgrade tests aren't quite valid, and some upgrade or compatibility snafu will cause a headache. The risk can be mitigated somewhat by creating another bootable disk and storing it somewhere during the upgrade. The fall back in that case is to switch boot disks and reboot from the old disk.

Other options, or combinations of the above are possible.

System Management by the Least Bit Principle

System Management Principle Number 6:

If you remove one more bit, the system will fail.

Minimal installs, when applied to systems, result in systems that have higher security availability.

This isn't new. This is just a restatement and perhaps an extension of normal securing and hardening of systems. The following are examples of how to apply least bit to various parts of your systems and applications.

File System Permissions (rights) are least bit when removing one more role, right or permission results in a failed application. The goal is to minimize permissions, either by starting with no permission bits and adding permissions until the application is fully functional, or by starting with a best guess on permissions and removing rights & permissions until the application fails. Ideally the vendor would have tested and configured the minimum permissions and rights for the application.

That rarely, if ever happens. The best example I've encountered is a couple decades ago when installing WordPerfect Office (now Novell GroupWise). The WordPerfect Office instructions gave a whole page of detailed instructions on exactly what file system permissions were needed through the entire application directory structure (Read, Filescan, Create, Modify, Write, Erase, etc). I tested every one of them & found that they almost had the principle down perfect. They only one extra bit in on one directory.

Avoid, at all costs, the system admin shortcut of 'just give it full rights, we can figure out what to take away later'. Later never happens.

Oh - and for the Unix'ers reading this - Netware actually HAD file system permissions.

File Systems are full of abandoned applications, abandoned installations, dead scripts and unused Java run-times (how many of those do you have on your servers?). Every one of them is a violation of the least bit principle.

Operating System Installations start with the most minimal installation possible. Use the 'Base' or 'Core' installation and add packages or options as needed for application functionality. Never, under any circumstances, do anything resembling 'full install'. Vendors might balk. System managers need to un-balk the vendors. We were told that Sun E25K's must have the full operating system installed and that our minimal, stripped-down install wasn't supported. We responded by advising that we could, if we so desired, run our 32 core Oracle servers on HPUX.

Microsoft, with Windows Server Core, appears to be moving in this direction also. Thank you Redmond. I'm starting to like you.

Minimal operating system installs carry security bonuses. It's a fine day when you can read down the weekly or monthly list of security vulnerabilities and X them off with 'not vulnerable, package not installed'.

This is a continuation of the age old hardening practice of minimizing the ports that a server listens on, and minimizing the number of services that are running. Except that the services are not even installed.

Databases have options, features and configuration that keep us busy full-time figuring out what they do & why we need them. DBA's are pretty good at minimizing user rights and roles. The least bit principle applies not only to user rights and roles, but also to database feature installation. If the application isn't using a feature, the feature should not be installed. With a minimally installed and configured Oracle base, when you walk through the quarterly Oracle nightmare that they call a Critical Patch Update, half the time you get lucky. You get to mark yourself down as 'not vulnerable, package not installed'.

Applications ship with every feature enabled. Disable them all and re-enable the ones that you are using. Delete sample applications and configs. Please. Remove all unnecessary bits from configuration files. If you cannot walk down through the application installation directory and know what each file & directory does, you've got to start reading & calling your vendor tech support. Tomcat does NOT need sample bindings, or whatever it comes pre-configured with. An Apache config can be a couple thousand bytes, instead of 40k. There is no reason to load 30 modules when your Apache instance is serving up plain old HTML. Your goal is 'not vulnerable, module not loaded'.

This also requires separation of application data from application code (or binaries, or executables). Any directory that has both executable code and data cannot be secured by the least bit principle.
Application vendors, including open-source projects, that ship fully configured applications are not doing system administrators and security people a favor. They are forcing us to walk through and entire application and figure out if all three deploy directories are really needed.

Application developers that write applications that mix executables and data should have their 401K's converted to Bear Stearns stock.

Firewalls and Load Balancers are obviously candidates for least bit. For firewalls, the least bit implies that every firewall rule can be tracked back to required application functionality. That also implies that the application managers or system managers know exactly which sever talks to whom, in what direction and on what port(s). Time to call the vendor. If they cannot tell you exactly what ports and protocols are used by each component of their application, then you need to admit to yourself that you bought the wrong product from the wrong vendor. And if they wrote an application that uses randomly generated ephemeral ports that can't be firewalled, you know you bought the wrong product.

For load balancers, the least bit implies not only that old, dead load balancer configs are removed, but also that the load balancer is configured to only forward URL's that are required by the application. If the application has an index.html in the application root and a half dozen jar/ear/war files, the load balancer should only forward index.html and /application/* for each jar/ear/war. At least then you know that if someone screws up and deploys a new, unsecured application, or forgets to de-provision an old application, the load balancer will toss the request rather than forward it on. In this case, the least bit principle results in an substantially longer and more complex configuration.

Network Devices. For switches and routers, this applies to the device firmware or operating system and the configuration. The least bit principle requires us to load the minimum operating system or feature set that is required for required functionality or application support. That means that you load base IOS images, not enterprise images, unless you have a specific requirement for enterprise features.

As security devices, the configuration should permit minimum required traffic and should block protocols and services that are not required for the functionality of the device attached to the switch or router. This means enabling more features on the switch or router (DHCP snooping, spanning tree root guard, etc.) In other words, the network devices only pass required traffic, blocking all other traffic.

This also applies to standard hardening of network devices, such as disabling unnecessary services, as is presumably already being done by network administrators.

Application Deprovisioning is the tail end of least bit principle. The final task when taking an application out of service is to remove its bits from servers, file systems, load balancers, databases and firewalls. As long as those bits are still in your datacenter, all the security baggage that application carried with it are still around your neck.

If this is System Management Principle #6, I suppose I'll have to dream up #1 through #5. When I think of them, I'll blog about them.

Privacy, Centralization and Databases

A fascinating article on the potential problems associated with privacy on a large government-run database was recently posted at ModernMechanix. The article appears to be in response to an effort to build a centralized data center that would contain personal records on US citizens. The interesting part is that the article appeared in The Atlantic in 1967. Reading it today makes it clear that not only did the author, Arthur R. Miller, lay bare the fundamental issues surrounding centralized government-managed data repositories, but that the issues have neither changed, nor been addressed. Unfortunately the article is probably far more relevant today.

The author is concerned that:
With its insatiable appetite for information, its inability to forget anything that has been put into it, a central computer might become the heart of a government surveillance system that would lay bare our finances, our associations, or our mental and physical health to government inquisitors or even to casual observers.
As we discuss a national database of health records, a national identity card; and as we already have centralized employment reporting, mandatory bank transaction reporting, centralized credit agencies, that though seemingly privately run, are protected from any reasonable privacy rules or laws by some unseen forces in DC, we can pretty much declare that except for cleaning up a few loose ends, we already have what Arthur Miller feared in 1967.

This sounds familiar:
The great bulk of the information likely to find its way into the center will be gathered and processed by relatively unskilled and unimaginative people who lack discrimination and sensitivity.
Show me a governmental agency where some clerk hasn't snooped around in other people's tax or health records. We know it happens. And occasionally we even hear about it in the media.

And of course, once you are in the database, how easy is it to get inaccurate records corrected?
An untested, impersonal, and erroneous computer entry such as “associates with known criminals” has marked him, and he is helpless to rectify the situation. Indeed, it is likely that he would not even be aware that the entry existed.
Does anyone actually think that the No-Fly list is accurate? Or that expunged records are expunged?

I'm still waiting for this to happen:
To ensure the accuracy of the center’s files, an individual should have an opportunity to correct errors in information concerning him. Perhaps a print-out of his computer file should be sent to him once a year.
Let me try that this weekend. I'll write a letter to every state and federal agency that has ever had any records on me and ask them for a copy. I'm sure that will work. While I'm at it, I'll ask for every photo from every traffic camera that I ever drove through also.

Who hasn't heard this idea:
One solution may be to store information according to its sensitivity or its accessibility, or both.
Wow. I just paid a $300/hr consultant to tell me that the new enterprise security best practices will require me to classify our data and secure it according to its sensitivity. I could have read a 40-year-old Atlantic for a dollar instead. (That was sarcasm. Data classification is obvious and self-evident.)

It probably will also be necessary to audit the programs controlling the manipulation of the files and access to the system to make sure that no one has inserted a secret “door” or a password permitting entry to the data by unauthorized personnel.
Fascinating. Code audits, account audits, integrity checking, intrusion detection.

If this is even the slightest bit interesting to you then page through this Harvard Law Review article also (PDF). The arguments are reiterated in greater detail. The notes in the margin are also very interesting.

I'm rather disappointed that my generation, the one that took computing from the early 1980s through today, seems to have neither come up with any significant new privacy issues, nor solved any longstanding privacy problems.

What's even more chilling is that the use of organized, automated data indexing and storage for nefarious purposes has an extraordinary precedent. Edwin Black has concluded that the efficiency of Hollerith punch cards and tabulating machines made possible the extremely "...efficient asset confiscation, ghettoization, deportation, enslaved labor, and, ultimately, annihilation..." of a large group of people that a particular political party found to be undesirable.

History repeats itself. We need to assume that the events of the first half of the twentieth century will re-occur someday, somewhere, with probably greater efficiency.

What are we doing to protect our future?