Last In - First Out: March 2008

Using RSS for System Status

It's time to re-think how status and availability information gets communicated to system managers. System status is sort of like news, so why not use a news reader? I don't mean using RSS for posting blog-like information to users like Our systems were borked last night from 03:00 to 05:00. That is a great idea, and many people already do a fine job in that space. I mean something more like "server16 HTTP response time greater than 150ms". The nerdy stuff.

Most IT organizations have monitoring systems that attempt to tell us the status of individual system components like servers, databases, routers, etc. The general idea is to make what we think are relevant measurements related to the performance or availability of components, record the measurements and hopefully alert on measurements that fall outside of some boundaries. Some places attempt to aggregate that information into some kind of higher level information that depicts the status of an application or group of related applications or networks.

That status information used to be presented on anything from dedicated X-windows sessions on high end workstations to simple Windows application interfaces. If you wanted to see what was happening, you logged into an application or console of some sort and looked for blinking icons or red dots, and perhaps a scrolling log window with so much crud in it that that you miss the important stuff anyway.

Somewhere along the line, some vendors moved the monitoring application interface to some form of HTTP like interface, perhaps with a big ugly blob of Java or an ActiveX control or two, and made is possible to look at status, availability and performance information from an ordinary desktop. And perhaps, if the vendor had a bold vision of the future, they may have even made it possible for more than one client to view the same information at the same time, and maybe even from an ordinary browser.

All of that works, or worked, but none of it solved my problem. System managers need to know what interesting things happened in the last few hours, or better yet, what interesting things have happened since the last time they checked for interesting things. I'm tired of having to access a dedicated application monitoring interface just to make a quick check of system status.

To see if there are better, simpler methods for checking system status, I prototyped a system that automatically creates an RSS feed for each hosted application. The concept is simple. Slurp up interesting status information, like response time, CPU percent, I/O per second, etc. Then organize the information in some reasonably logical format and present it as an RSS or atom news feed.

Here's roughly how it looks:

One RSS or atom feed for each application.
One article per host or device in the applications dependency tree
With a primitive dump of host status as CDATA in the article body
With a link to the host status page as the article title.
Update the feeds every minute or so.
Re-publish the host/device individual article any time the host status changes, using the status change time as the article publish time.

So how does it work? From what I can tell, the news readers aren't really set up for real time monitoring. The minimum feed refresh time tends to be around 10 minutes, which for a real time application is pretty much a lifetime. But, for non-real time status, or for recent historical status (like the last few hours) it seems to work pretty good.

The key seems to be to only update the article that corresponds to a host or device when the device status changes. That allows the news feed readers to bubble up to the top any status changes, even if the status changed from good to bad and back to good since the last time you viewed the feed. The reader sees the article (device) as re-published, so it presents it as a new article. The act of marking the article as read removes it or unhighlights it in the reader, effectively backgrounding it until the next time the device status changes. When the status changes, the reader sees the article as recently published and highlights it accordingly.

The reader has to be smart enough to drop off or un-highlight 'read' articles. If you know that server16 had slow response time a hour ago, you need to be able to mark the article as 'read', effectively suppressing that information until its status changes and the article gets republished.

The title of the device's article can be suitably mangled to present Good/Bad/Ugly status, so readers see the important information without opening the article, and the article body can contain the details of why a device has or had a particular status and appropriate timestamps for status changes. The GUID in RSS 2.0 has to uniquely identify the host or device, so the reader can accurately track the associated article.

So far, it works.

Availability, Longer MTBF and shorter MTTR

A simple analysis of system availability can be broke down into to two ideas. First, how long will I go, on average, before I have an unexpected system outage or unavailability (MTBF). Second, when I have an outage, how long will it take to restore service.

Any discussion of availability, MTBF, MTTR can quickly descend into endless talk about exact measurements of availability and response time. That sort of discussion would be appropriate in cases where you have availability clauses in contractual obligations or SLA's. What I'll try to do is frame this as a guide to maintaining system availability for an audience that isn't spending consulting dollars, and who is interested in available systems, not SLA contract language.

Less failure means longer MTBF

What can I do to decrease the likelihood of unexpected system down time? Here's my list, in the approximate order that I believe they affect availability.

Structured System Management. You gain availability if you manage your systems with some sort of structured processes or framework. As a broad concept, this means that you have the fundamentals of building, securing, deploying, monitoring, alerting, and documenting networks, servers and applications in place, you execute them consistently, and you know all cases where you are inconsistent. (As compared to what I call ad hoc management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.) This structure doesn't need to be a full ITIL framework that a million dollars worth of consultants dropped in your lap, but it has to exist, even if only in simple, straightforward wiki or paper based systems.

Stable Power, UPS, Generator. You need stable power. Your systems need dual power supplies or they need to be redundant, the redundant power supplies need to be on separate circuits, you need good quality UPS's with fresh batteries, and depending on your availability requirements you may need a generator. In some places, even in large US metro areas, we've had power failures several times per year, and in cases where we had no generator, we had system outages as soon as the UPS's ran out.

Good Hardware. I'm a fan of tier 1 hardware. I believe that tier one vendors add value to the process of engineering and servers, storage and network hardware. That means paying the price, but that also means that you generally get systems that are intentionally engineered for performance and availability, rather than randomly engineered one component at a time. A tier one vendor with a 3 year warranty has a financial incentive to build hardware that doesn't fail.

Buying into tier one hardware also means that in the case of servers, you get the manufacturers system management software for monitoring and alerting, you generally some form of predictive failure, and you get tested software and component compatibility.

Tier one hardware has worked very well for us, with one exception. We have a clustered pair of expensive 'enterprise' class servers that have had more open hardware tickets on them than any 30 of our other vendors' servers.

Good logging monitoring and management tools. Modern hardware and software components try really hard to tell you that they are going to fail, sometimes even a long time before they fail. ECC memory will keep trying to correct bit errors, hard disks will try to correct read/write errors, and all those attempts get logged somewhere. Operating systems and applications tend to log interesting things when they think they have problems. Detecting the interesting events and escalating them to e-mail, SMS or pagers gives you a fair shot at resolving a problem before it causes an outage. I'd much rather get paged by a server that thinks it is going to die, rather than from one that is already dead.

Reasonable Security. A security event is an unavailability event. The Slammer Worm was an availability problem, not a security problem. A security event that results in lost or stolen data will also be an availability event.

Systematic Test and QA. You have the ability to test patches, upgrades and changes, right? Any offline test environment is better than no test environment, even if the test environment isn't exactly identical to production. But as your availability requirements go up, your test and QA become far more important, and at higher availability requirements, test and QA that matches production as closely as possible is critical.

Simple Change Management. This can be really simple, but it is essential. A notice to affected parties, even if the change is supposed to be non-disruptive, a checklist if exactly what you are going to do, and a check list of exactly how you are going to undo what you just did are the first steps. You need to know who changed what and when and why they changed it. If you have no change process, a simple system will improve availability. If you already have a system, making it more complex might or might not improve availability

Neat cabling and racks. You cannot reliably touch a rack that is a mess. You'll break something that wasn't supposed to be down, you'll disconnect the wrong wire and generally wreak havoc on your production systems.

Installation Testing. You know that server has functional network multipathing and redundant SAN connections because you tested them when you installed them, and you made sure that when you forced a path failure during your test you got alerted by your monitoring system, right?

Reducing MTTR

Once the system is down, the clock starts ticking. Obviously your system is unavailable until you resolve the problem, so any time shaved of the problem resolution part of the equation increases availability.

Structured Systems Management and Change Management. You know where your docs are, right? And you can access them from home, in the middle of the night, right? You also know what last changed and who changed it, right? You know what sever is on what jack on what switch, right? Finding that out after you are down adds to your MTTR.

Clustering, active/passive fail over. In many cases, the fastest MTTR is a simple active passive fail over. Pairs of redundant firewalls, load balanced app servers, and clustered database and file servers do not improve your MTBF. They may, in fact, because of the complexity factor, increase your likelihood of failure. But when properly configured, they greatly decrease the time it takes to resolve the failure. A failed component on a non-redundant server is probably an 8 hour outage, when you consider the time to detect the problem alert on call staff in the middle of the night, call a vendor, wait for parts and installation, and restart the server. That would likely be a 3 minute cluster fail over on a simple active/passive fail over pair.

Failover doesn't always work though. We have situations where the problem was ambiguous, service is affected, but the clustering heartbeat is not affected, so the inactive device or server doesn't come on line. In that case, the MTTR is the time that is take for your system manager to figure out that the cluster isn't going to fail over on its own, remote console in to the lights-out board on the mis-behaving server, and disable it. The passive node then has an unambiguous decision to make, and odds are it will make the decision to become active.

Monitoring and logging. Your predictive failure messages, syslog messages, netflow logs, firewall logs, event logs and perfmon data will all be essential to quick problem resolution. If you don't have logging and alerting on simple parameters like high CPU, disks full, memory ECC error counts, process failures, of if you can't readily access that information, your response time will be increased by the amount of time it takes to gather that simple data.

Paging, on-call. You people need to know when your stuff fails, and if they are on call, they need a good set of tools at home, ready to use. Your MTTR clock starts when the system fails. You start resolving the problem when your people are connect to their toolkit, ready to start troubleshooting.

Remote Consoles and KVM's. You really need remote access to your servers, even if you work in a cube right next the data center. You don't live next to the server, right? And you probably only work 12 hours a day, so for the other 12 hours a day you need remote access.

Service Contracts and Spares. Your ability to quickly resolve an outage likely will depend on how soon you can bring your vendors tech support on line. That means having up to date support contracts, with phone numbers and escalation processes documented and readily available. It also means that you need to need to have appropriate contracts in place at levels of support that match your availability requirements. Your vendors need to be on the hook at least as bad as you are, and you need to have worked out the process for escalating to them before you have a failure.

Tested Recovery. You need know, because you have tested and documented, how you are going to recover a failed sever or device. You cannot figure it out on the fly. Your MTTR window isn’t long enough.

A Redundant Array of Inexpensive Flash Drives

A redundant ZFS volume on flash drives? A RAID set on flash drives obviously has no practical value right now, but perhaps someday small redundant storage devices will be common. So why not experiment a bit and see what they look like.

Start with an old Sun desktop, Solaris 10 08/07, and a cheap USB 2.0 PCI card. Add a powered USB hub and a handful of cheap 1GB flash drives.

Plug it in, watch it light up.

If you've got functioning USB drivers on your Solaris install you should see them recognized by the kernel and the links in /dev should automatically get created. USB events get logged to /var/adm/messages, so a quick check there should tell you if the flash drives are recognized. If you can't figure out what the drives got named in /dev, you should be able to match up the descriptions in /var/adm/messages. In my case, they ended up as c2t0d0, c3t0d0, c4t0d0, c5t0d0.

For this project, I didn't want the volume manager to automatically mount the drives, so I temporarily stopped volume management.

# svcs volfs
STATE STIME FMRI
online 21:50:41 svc:/system/filesystem/volfs:default

#svcadm disable volfs

# svcs volfs
STATE STIME FMRI
disabled 22:38:15 svc:/system/filesystem/volfs:default

I labeled each of the drives

# fdisk -E /dev/rdsk/c2t0d0s2
# fdisk -E /dev/rdsk/c3t0d0s2
# fdisk -E /dev/rdsk/c4t0d0s2
# fdisk -E /dev/rdsk/c5t0d0s2

Then I created a RAIDZ pool called 'test' using all four drives.

zpool create test raidz c2t0d0s2 c3t0d0s2 c4t0d0s2 c5t0d0s2

The pool got built and mounted in a couple seconds.

# zpool status
pool: test
state: ONLINE
scrub: scrub completed with 0 errors on Sun Mar 23 21:55:32 2008
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t0d0s2 ONLINE 0 0 0
c3t0d0s2 ONLINE 0 0 0
c4t0d0s2 ONLINE 0 0 0
c5t0d0s2 ONLINE 0 0 0

errors: No known data errors

The file system shows up as available

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
test 105K 2.81G 36.7K /test

I'm not sure what I'm going to do with it yet. It certainly isn't something I'll carry around in my pocket.

But if someone would make a USB stick that was the size of a current USB drive, and let me plug 4 or 5 8GB micro SD's in to it, and build the RAID magic into the drive, I'd have a whole bunch of storage in a pocketable form factor and redundancy as a bonus.

Availability, Complexity and the Person-Factor

I am trying out a new hypothesis:

When person-resources are constrained, highest availability is achieved when the system is designed with the minimum complexity necessary to meet availability requirements.

My hypothesis, that minimizing complexity maximizes availability, assumes that in an environment where the number of persons is constrained or fixed, as systems become more complex the human factors in system failure and resolution become more important than technology factors.

This hypothesis also assumes that increased system availability generally presumes an increase in complexity. I am basing this on a combination of a simple analysis of availability combined with extensive experience managing technology.

Person Resources vs Complexity

As availability requirements are increased, the technology required becomes more complex.
As the technology gets more complex, the person-resources to manage the technology increases.
Resources are generally constrained, so the ideal resource allocation is unlikely to occur in the real world.

In an ideal organization, as system availability requirements go up, both person-resources and technology resources would increase as necessary to support the increased availability requirement. The relationship between availability and person resources should look something like the first chart. In theory the initial investment in structured system management and simple redundancy will result in a large improvement in availability relative to the resources spent. Moving from ad-hoc system management toward structured system management will result in fewer unplanned downtimes and less time spent troubleshooting problems, so both MTBF and MTTR should improve. Moving from non-redundant systems to simple redundancy (load balanced app servers, active/passive firewall and network failover, active/passive clustering, etc) will result in faster recovery time on failures, so even though the MTBF will not improve, MTTR will improve, therefore availability will improve.

When simple active/passive redundancy is no longer adequate to achieve required availability, system complexity is greatly increased. Availability targets that require active/active clustering, multi-homed servers, redundant data centers, layer-2 network redundancy with sub-minute recovery times require more person-resource relative to the resulting increase in availability. If person-resources are added along with the necessary technology resources, the availability will continue to increase.

If however, person-resources are not available to support the increase complexity brought on by the increased availability requirements, the availability curve will look something like the second chart. The systems will be complex to manage, but the existing person-resources, if not supplemented, will be unable to adequately design, test, deploy the more complex environment. Most importantly though, in the event of a failure of the more complex environment, more time will be spent troubleshooting and resolving problems, potentially increasing MTTR and decreasing availability.

This may be nothing more that a restatement of the K.I.S.S principle.

Estimating Availability of Simple Systems - Non-redundant

Estimating Availability of Simple Systems – Redundant

Availability, Longer MTBF and shorter MTTR

Availability - MTBF, MTTR and the Human Factor

(2008-07-19 -Updated links, minor edits)

Tethered or Untethered?

I don't want to be tethered.

From The Free Dictionary:

A rope, chain, or similar restraint for holding an animal in place, allowing a short radius in which it can move about.

I don't think I am an animal, at least in the sense that animals are somehow distinguished from humans.

Unconstraining a constrained resource

When a technology that is normally constrained is presented with unlimited resources, what will it do?

We've had a couple of interesting examples of what happens when software that is normally constrained by memory has that constraint removed. The results were interesting, or amusing, depending on your point of view.

Too much memory

The first time we ran into something like this was when we installed a new SQL server cluster on IBM x460 hardware. We bought the server to resolve a serious database performance problem on an application that was growing faster than we could purchase and install new hardware. To get ahead the growth curve, we figured that we had to more than double the performance of the database server. And of course while we were at it, we'd cluster it also. And because HP wouldn't support Windows 2003 Datacenter Edition on an HP EVA, we ended up buying a an IBM DS4800 also. And because we had neither the time nor the resources to spend endless hours tuning the I/O, we'd get enough memory that the database working set was always in memory. And no, the application vendor wouldn't support any form of horizontal scalability whatsoever.

So after a month of research, a month to get a decision, a month to get through our purchasing process, and a month for IBM to build the server, we got two new x460 16-CPU Xeon servers, a DS4800, all pre-racked and ready to boot. The servers hit the new datacenter floor at noon on Dec 24th. The application had to be moved to the new datacenter on the new servers and be ready for production on January 6th. We were stressed. The schedule was something like Day 1: build severs. Day 2: build SAN. Day 3 cluster and failure testing (and so on ....)

When we got to 'Day4: Install SQL server 2000' we ran into a pretty significant issue. It wouldn't run. SQL server 2000 simply would not start. It gave a cryptic error message and stopped. Microsoft Tier 3 SQL support to the rescue. They never heard of THAT error message. Escalate...wait...escalate...wait...bring IBM Advanced Support into the conference call...escalate..wait...now there are two of us and 6 of them on the call...finally:

'We've got a bug ID on SQL server 2005, that sounds similar. But it only occurs when you run a 32 bit database on a server with more than 64GB of memory.'

Hmm.... we are on 32 bit 2000, not 2005, but we do have 128GB of memory, so maybe?

The workaround suggested was to remove memory. But now it is late on a Friday, and we are going live next Wednesday, and spending hours carefully removing memory sounded like a headache. Fortunately /burnmem worked, the server only saw 64GB of memory, and SQL2000 was happy to run, though slightly confused. The cut over was successful, and we survived spring semester start up with a database that ran 16 CPU's at 70% busy on our peak day (instead of an 8 CPU database server at 140% CPU).

It probably never occurred to the SQL server developers, back when the database memory model was designed and database boot/init/load code was written, that a customer would try to run a 32-bit database with 128GB of memory. No way. What did the software do? It croaked. It rolled over and died. It didn't even try.

(That server lasted a semester. The application utilization continued to grow, and the two 16 CPU servers became 16 dual cores @ 50% utilization before the start of the next semester.)

Too much memory - Part 2

Fast forward (or move the slider bar on your OS X Time Machine) a year and a half. The app is still growing. A couple rounds of tuning, vendor code re-writes, and a couple of Microsoft engineering on-sites tells us that we really, really want to be running this thing with all 64 bits & SQL server 2005.

So we take the plunge. Windows 2003 Datacenter, x64, this time with SQL server 2005, all 64 bits, and all 128GB of memory. Life is good, right?

Not quite.

An interesting question: What does SQL Server 2005 64bit do when you feed it an application that forces it to parse, compile and cache hundreds of new queries every second, when the queries are such that they can't be parameterized? It parses, complies and caches all of them. As long as the procedure cache is constrained, all you get is an inefficient database sever and a high cache curn rate. But it works. When there is no constraint? It's procedure cache gets too big, and it gets really, really unhappy. And so do the hundred thousand students who thought they were going to use the application that day.

As best as we can figure, based on a 13 hour tech support call and handful of PSSDiags & such, we fed new, unparameterizable queries into the database engine, it cached them, and when the procedure cache got up somewhere around 8GB, the server pretty much spent all its cycles mucking around with managing its cache at the expense of doing useful database-like work. The cache never got trimmed or purged, probably because whatever process does that didn't see a reason to expire old cache entries. There was plenty of memory.

Fix the app, patch the database servers, and monitor the procedure cache, and all is well.

When a technology that is normally constrained is presented with unlimited resources, what will it do?

It'll probably croak.

Autodeploying Servers - A Proof of Concept

A couple years ago, during the brief pause in the middle of a semester, I figured that some time spent on re-thinking how we deploy remote servers might be worth time spent. Fortunately we have sharp sysadmin's who like challenges.

Our network is run by a very small group that has to cover a rather large state, 7x 24. Driving a 600 mile round trip to swap some gear out isn't exactly fun, so we tend to be very cautious about what we deploy & where we deploy it. I've always had a strong preference for hardware that runs forever, and my background in mechanical sorts of things tells me that moving parts are generally bad. So If I have a choice, and if the device is a long ways away, I'll pick a device that boots from flash over an HDD any time.

That sort of works in with my thoughts on installing devices in general, that we ought to design & build to the 'least bit' principle, where we minimize the software footprint, file system permissions, ports, protocols, etc, as much as possible and still maintain required functionality.

Anyway, at the time, we'd had Snort IDS's out state for quite a while, and we were thinking that they needed a hardware swap. But I wasn't enthused about installing more spinning media 5 hours from home.

So that led to a proof-of-concept.

How close could we come, with no money, to making a IDS that had no moving parts?
How minimal could the software footprint be?
How would we reliably upgrade, patch and maintain the remote IDS's without travel?
Could we make the remote IDS's 'self deploying' (with no persistent per-device configuration)
Could we run and Intel PC headless (no keyboard/monitor?) and manage it via 9600,n,8,1?

Knoppix

We started by taking a look at the CD-bootable Knoppix. At the time, it didn't like being run headless. There was no access to the serial port early in the boot process, so if it didn't boot, we wouldn't know why. That doesn't work too well for remote devices, but fortunately there was enough prior work in that space to get us a bootable Knoppix that lit up the serial port early enough to make us happy. Booting from flash would have been better, but a spinning CD is way closer to 'no moving parts' than a spinning HDD, and we didn't have any no-cost flash-bootable computers laying around.

It was still way too fat to fit the 'least bit' principle, and there still was no easy way of maintaining the Snort binaries and configs without touching each server. I didn't want to have to do that. So we stripped Knoppix down to something reasonably sized, convinced it to look at the serial port instead of the keyboard, burned a new image, and took a look at how to maintain Snort.

We had just prototyped a process that used SVN to manage and deploy the binaries for a J2EE application, so why not look at SVN for Snort? Check the Snort RPMs, compiled binaries, libraries and rules into SVN, and when they change, deploy from SVN to the PC acting as the IDS. Snort has logs, but it can log to a remote database, so persistent, writable storage is really not necessary. The spinning media problem more or less goes away. So far, so good.

Self Deploy

A long time ago, Cisco made routers that were smart enough to an ARP/RARP/BOOTP & tftp to find a config. That meant that if the router, brand new and unconfigured, were attached to a network and found a tftp server with a 'router.conf' on it, the router would boot that config. That made it possible to buy a new router, ship it directly to a remote side, and with no router-savy person on site, install the router unconfigured, have it 'self configure' and become a useful router. All you needed was an upstream network device (like the router on the near side of the T1) that was smart enough to feed a couple of bits of information to the new, unconfigured router, and the router would 'self deploy'.

So I wanted our new toys to 'self deploy'. Send a new, unconfigured, generic PC to a remote site, plug it in and have it learn enough from the network to figure out who it is and what it's supposed to do.

Having a minimal Knoppix-like boot/run CD that talked to its serial port in a reasonable manner in hand, we figured that the network could feed the server enough information to make it useful, like a IP address and gateway. But getting Snort and related bits on the new server need some more thinking. If we put Snort on the CD, we'd always have to burn new CD's when Snort needed updating, and we'd still have to manage the constantly changing Snort rules. That would either require lots of CD burning or persistent writable storage.

What we came up with was - put the Snort packages in SVN at the home base, add an SVN client to the boot CD, and install Snort and related dependencies from SVN on the fly on each boot. The remote server would boot from CD, learn who it is from the network (DHCP), figure out where its SVN repository was (either DNS or DHCP options), look up its MAC address in the SVN repository, and check out whatever it saw. If what it saw were Snort binaries, it checked them out & installed them.

Boot -> Learn -> Install -> Run.

To upgrade the boot image, we'd still have to ship a CD to the remote site. But that doesn't happen too often. To upgrade Snort, we'd check a new version of Snort into the master SVN repository at the home base & reboot the remote servers. They'd learn about it on bootup. If we wanted to tweak the config on a sensor, we'd check a new config into SVN and reboot the remote.

Pretty simple.

And - If we didn't want the remote to be a Snort sensor anymore, we could check some other bits into SVN, reboot the remotes & they happily check out, install & run the new bits. Turn it into a web server? Just check in new binaries & reboot the remote.

It worked.

But then real projects came along, so we never deployed it.

On Least Bit Installations and Structured Application Deployment

I ran across this post by 'ADK'.

It's not done until it's deployed and working!

This seems so obvious, yet my experience hosting a couple large ERP-like OLTP applications indicates that sometimes the obvious needs to be made obvious. Because obviously it isn't obvious.

A few years ago, when I inherited the app server environment for one of the ERP-like applications (from another workgroup), I decided to take a look around and see how things were installed, secured, managed and monitored. The punch list of things to fix got pretty long, pretty quickly. Other than the obvious 'everyone is root' and 'everything runs as root' type of problems, one big red flag was the deployment process. We had two production app servers. They were not identical. Not even close. The simplest things, like the operating system rev & patch level, the location of configuration files, file system rights and application installation locations were different, and in probably the most amusing and dangerous application configuration I've ever seen, the critical config files on one of the app servers was in the samples directory.

So how do we fix a mess like that? We started over, twice (or three times).

Our first pass through was intended to to fix obvious security problems and to make all servers identical. We deployed a standard vendor operating system installation on four identical servers, 1 QA, 3 production, deployed the necessary JBoss builds on each of the servers in a fairly standard install, and migrated the application to the new servers. This fixed some immediate and glaring problems and gave us the experience we needed to take on the next couple of steps.

The second pass through the environment was designed to get us closer to the 'least bit' installation best practice, clean up some ugly library and application dependencies, and develop a standardized, scripted deployment, guaranteed to be identical on each servers, and guaranteed to be identical to test/QA.

The first part, the 'least bit' installation, simply means that in the JBoss directory structure, if one were to remove a single bit from file system permissions, config files, or the application itself, the app would break. This ensures that file system permissions are optimal and that the application has no extra junk (sample configs, sample JDBC connections) laying around that can cause security or availability problems.

The application deployment process that we developed was very interesting. We decided that we wanted to have a completely version controlled deployment, that included all files that the entire application needs for functionality other than the vendor provided operating system files. We checked the entire JBoss application, including the binaries, configs, war's, jar's & whatever, into Subversion (SVN). The deployment is essentially a process that checks out the entire application from SVN, drops it onto a production app server and removes the entire old application. The idea is that we now know not only what is deployed on each server, and that all servers are identical, we also know exactly what was deployed on all servers at any arbitrary point in the past.

The process to get the application ready to deploy is also version controlled and scripted. Our dev team builds a JBoss instance and their application binaries and configs using their normal dev tool kit. When they think that they have a deployable application, including the code they wrote, the JBoss application and any configuration files, they package up the application, check it into a special deployment SVN repository, and deploy the entire application to a QA/test environment. They've got the ability to deploy to the test/QA servers any time they want, but they do not have the ability to modify the QA/test environment other than through the scripted, version controlled deployment process. If the app doesn't work, they re-configure, bug fix, check it back into the deployment SVN repository and re-deploy to QA/test. Once the app is QA'd and tested, a deployment czar in the dev team modifies any config files that need to be different between QA and prod (like database connect strings) and commits the application back into the deployment SVN repository.

Then we get a note to 'deploy version 1652 to prod next Monday'. On Monday, we take each app server offline, in turn, run scripts that archive the old installation, copy the entire new application to the production server, test it, and bring the app server back on line. Repeat once for each app server & we are done.

We made a third pass through the app servers. This time we changed the OS version, the platform and we implemented basic Solaris 'Zones' to give us a sort of pseudo-virtualization, and we applied the 'least bit' principle to the operating system itself (reducing its disk footprint from 6GB down to 1GB). We also, as a result of the systematic deployment process, have the ability to light up new app servers fairly easily. A Solaris WAN boot from a standard image, a bit of one-time tweaking of the OS (hostname, IP address) and a deploy of the JBoss app from SVN gets us pretty close to a production app server.

We have a ways to go yet. Some parts of the deploy process are clunky, and our testing methodology is pretty painful and time consuming, and we are not fully 'containerized' into Solaris zones. The dev team wants at least one more pre-production environment, and we need to make the deployment process less time consuming.

We aren't done yet.

2008-03-08

Availability - MTBF, MTTR and the Human Factor

We've got a couple of data centers. By most measures, they are small, at just over 100 servers between the two of them. We've been working for the last year to build the second data center with the specific design goal of being able to fail over mission critical applications to the secondary data center during extended outages. Since the goal of the second data center is specifically to improve application availability, I decided to rough in some calculations on what kind of design we'd need to meet our availability and recovery goals.

Start with MTBF (Mean Time Between Failure). In theory, each component that makes up the infrastructure that supports an application has an MTBF, typically measured in thousands of hours. The second factor in calculating availability is MTTR. (Mean Time to Recovery or Repair), measured in hours or minutes. Knowing the average time between failures, and the average time to repair or recover from the failure, one should be able to make and approximate prediction of the availability of a particular component.

Systems, no matter how complex, are made up of components. In theory, one can calculate the MTBF and MTTR of each component and derive the availability of an entire system. This can get to be a fairly involved calculation, and I'm sure somewhere there are people who make a living doing MTBF/MTTR calculations.

If this is at all interesting, read eventhelix.com's description of MTBF/MTTR and availability. Then follow along with my calculations, and amuse yourself at the conclusion. Remember, all the calculations are real rough. I was trying to get an order of magnitude estimate based on easily obtainable data, not an exact number.

Raw MTBF Data:

Single Hard Drive, high end manufacturer:	1.4 million hours
Single DIMM, moderately priced manufacturer:	4 million hours
Single CPU:	1 million hours
Single network device (switch):	200,000 hours
Single person, non-redundant:	2000 hours

OK - the Single Person, non-redundant I just made up. But I figure that each employee probably will, if left alone with no constraints (change review, change management) will screw up once a year.

MTTR is simply an estimate of how long it takes to restore service in the event of a failure. For example, in the case of a failed non-redundant HDD that has a server OS and lots of data on it, that the MTTR would be the time it takes to call the vendor, have a spare delivered, rebuild the server and recover everything from tape backup. Figure 8 hours for a simple server. Or, in the case of a memory DIMM, I figure on hour to get through the tech support queue at the manufacturer, one hour to argue with them about your contract terms & conditions, 4 hours to deliver the DIMM, an hour to install and re-boot, or about 7 hours.

In the case of redundant or clustered devices, the MTTR is effectively the time it takes for the clustering or high availability software to figure out what has failed and take the failed device off line or route around it. Figure somewhere between 1 and 3 minutes.

To convert MTBF and MTTR to availability, some math is needed.

If systems have more than one component, and if all components are required for the system to operate, then the availability of the system is calculated by multiplying the availability of each component together to obtain the system availability. If the components are redundant, and if each of the redundant components is fully functional alone, then the availability of the system is (1 - (1-A)ⁿ). (Or - use an online calculator.)

Obviously in the case of redundant components, the system is only failed when both redundant components fail at the same time. That tends to be rare, so redundant systems can have much better availability.

How does this work out using some rough numbers?

First of all, an interesting problem immediately becomes evident. For maximum performance, I'd build a server with more CPU's, more memory DIMMS, more network cards, and more disk controllers. The component count goes up. If I build the server without redundancy, the server is less reliable. But faster. Most server vendors compensate for this by building some components with error correction capability, so that even thought component fails, the system still runs. But more components, in theory, will result in a less reliable server. So in rough numbers, using only memory DIMMs in the calculation, and assuming no error correction or redundancy:

DIMM Count	MTBF (Hours)	MTTR (Hours)	Availability
1	4 million	8 hours	99.9998%
4	4 million	8 hours	99.9992%
8	4 million	8 hours	99.9984%
16	4 million	8 hours	99.9968%
32	4 million	8 hours	99.9936%
64	4 million	8 hours	99.9972%

Adding non-redundant components doesn't help availability. A high end server with lots of DIMM's, is already at four 9's, without even considering all the other components in the system.

In clustered, or high availability systems, the availability calculation changes dramatically. In the simple case of active/passive failover, like Cisco firewalls or Microsoft clustering, the MTBF doesn't change, but the MTTR does. The MTTR is essentially the cluster failover time, usually a minute or so. Now the time that it takes the tech to show up with parts is no longer relevant.

Take the same numbers above, but make the MTTR 3 minutes (cluster failover time) and the theoretical availability goes way up.

DIMM Count	MTBF (Hours)	MTTR (Hours)	Availability
1	4 million	.05 hours	99.9999%
4	4 million	.05 hours	99.9999%
8	4 million	.05 hours	99.9999%
16	4 million	.05 hours	99.9999%
32	4 million	.05 hours	99.9999%
64	4 million	.05 hours	99.9999%

Active/passive clustering doesn't change MTBF, but is does change MTTR, and therefore availability. I ran through a handful of calculations, and figure out that if all we have is a reasonable level of active/passive redundancy on devices and servers, and a reasonably well designed network and storage, we should be able to meet our availability goals.

Except......

The Human Factor:

So I was well on my way down this path, trying to determine if Layer-3 WAN redundancy was adequate to meet availability goals, and if active/passive clusters combined with dual switch fabrics and HA load balancers and firewalls will result in MTBF and MTTR's that make sense, when I decided to figure in the human factor.

I figure that the more complex the system, the higher the probability of a human induced failure, and that humans have a failure rate that is based not only on skill and experience, and also on the structure of the processes that are used to manage the systems.

Assume, as above, that a person makes one error per year, and that the error results in 30 minutes of downtime. That already drops you down to 99.99% availability. That's without considering any hardware, software, power or other outages. If you figure that you have a handful of persons that are in a position to make an error that results in downtime, you've got a problem.

Fortunately, persons can be made redundant, and processes (Change Management) can be designed to enforce person-level redundancy.

So now I'm thinking about Change Management, QA environments, test labs.

(2008-07-19 -Updated links, minor edits)