Sunday, March 30, 2008

Using RSS for System Status

It's time to re-think how status and availability information gets communicated to system managers. System status is sort of like news, so why not use a news reader? I don't mean using RSS for posting blog-like information to users like Our systems were borked last night from 03:00 to 05:00. That is a great idea, and many people already do a fine job in that space. I mean something more like "server16 HTTP response time greater than 150ms". The nerdy stuff.

Most IT organizations have monitoring systems that attempt to tell us the status of individual system components like servers, databases, routers, etc. The general idea is to make what we think are relevant measurements related to the performance or availability of components, record the measurements and hopefully alert on measurements that fall outside of some boundaries. Some places attempt to aggregate that information into some kind of higher level information that depicts the status of an application or group of related applications or networks.

That status information used to be presented on anything from dedicated X-windows sessions on high end workstations to simple Windows application interfaces. If you wanted to see what was happening, you logged into an application or console of some sort and looked for blinking icons or red dots, and perhaps a scrolling log window with so much crud in it that that you miss the important stuff anyway.

Somewhere along the line, some vendors moved the monitoring application interface to some form of HTTP like interface, perhaps with a big ugly blob of Java or an ActiveX control or two, and made is possible to look at status, availability and performance information from an ordinary desktop. And perhaps, if the vendor had a bold vision of the future, they may have even made it possible for more than one client to view the same information at the same time, and maybe even from an ordinary browser.

All of that works, or worked, but none of it solved my problem. System managers need to know what interesting things happened in the last few hours, or better yet, what interesting things have happened since the last time they checked for interesting things. I'm tired of having to access a dedicated application monitoring interface just to make a quick check of system status.

To see if there are better, simpler methods for checking system status, I prototyped a system that automatically creates an RSS feed for each hosted application. The concept is simple. Slurp up interesting status information, like response time, CPU percent, I/O per second, etc. Then organize the information in some reasonably logical format and present it as an RSS or atom news feed.

Here's roughly how it looks:

  • One RSS or atom feed for each application.
  • One article per host or device in the applications dependency tree
  • With a primitive dump of host status as CDATA in the article body
  • With a link to the host status page as the article title.
  • Update the feeds every minute or so.
  • Re-publish the host/device individual article any time the host status changes, using the status change time as the article publish time.
So how does it work? From what I can tell, the news readers aren't really set up for real time monitoring. The minimum feed refresh time tends to be around 10 minutes, which for a real time application is pretty much a lifetime. But, for non-real time status, or for recent historical status (like the last few hours) it seems to work pretty good.

The key seems to be to only update the article that corresponds to a host or device when the device status changes. That allows the news feed readers to bubble up to the top any status changes, even if the status changed from good to bad and back to good since the last time you viewed the feed. The reader sees the article (device) as re-published, so it presents it as a new article. The act of marking the article as read removes it or unhighlights it in the reader, effectively backgrounding it until the next time the device status changes. When the status changes, the reader sees the article as recently published and highlights it accordingly.

The reader has to be smart enough to drop off or un-highlight 'read' articles. If you know that server16 had slow response time a hour ago, you need to be able to mark the article as 'read', effectively suppressing that information until its status changes and the article gets republished.

The title of the device's article can be suitably mangled to present Good/Bad/Ugly status, so readers see the important information without opening the article, and the article body can contain the details of why a device has or had a particular status and appropriate timestamps for status changes. The GUID in RSS 2.0 has to uniquely identify the host or device, so the reader can accurately track the associated article.

So far, it works.

Wednesday, March 26, 2008

Availability, Longer MTBF and shorter MTTR

A simple analysis of system availability can be broke down into to two ideas. First, how long will I go, on average, before I have an unexpected system outage or unavailability (MTBF). Second, when I have an outage, how long will it take to restore service.

Any discussion of availability, MTBF, MTTR can quickly descend into endless talk about exact measurements of availability and response time. That sort of discussion would be appropriate in cases where you have availability clauses in contractual obligations or SLA's. What I'll try to do is frame this as a guide to maintaining system availability for an audience that isn't spending consulting dollars, and who is interested in available systems, not SLA contract language.

Less failure means longer MTBF

What can I do to decrease the likelihood of unexpected system down time? Here's my list, in the approximate order that I believe they affect availability.

Structured System Management. You gain availability if you manage your systems with some sort of structured processes or framework. As a broad concept, this means that you have the fundamentals of building, securing, deploying, monitoring, alerting, and documenting networks, servers and applications in place, you execute them consistently, and you know all cases where you are inconsistent. (As compared to what I call ad hoc management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.) This structure doesn't need to be a full ITIL framework that a million dollars worth of consultants dropped in your lap, but it has to exist, even if only in simple, straightforward wiki or paper based systems.

Stable Power, UPS, Generator
. You need stable power. Your systems need dual power supplies or they need to be redundant, the redundant power supplies need to be on separate circuits, you need good quality UPS's with fresh batteries, and depending on your availability requirements you may need a generator. In some places, even in large US metro areas, we've had power failures several times per year, and in cases where we had no generator, we had system outages as soon as the UPS's ran out.

Good Hardware. I'm a fan of tier 1 hardware. I believe that tier one vendors add value to the process of engineering and servers, storage and network hardware. That means paying the price, but that also means that you generally get systems that are intentionally engineered for performance and availability, rather than randomly engineered one component at a time. A tier one vendor with a 3 year warranty has a financial incentive to build hardware that doesn't fail.

Buying into tier one hardware also means that in the case of servers, you get the manufacturers system management software for monitoring and alerting, you generally some form of predictive failure, and you get tested software and component compatibility.

Tier one hardware has worked very well for us, with one exception. We have a clustered pair of expensive 'enterprise' class servers that have had more open hardware tickets on them than any 30 of our other vendors' servers.

Good logging monitoring and management tools. Modern hardware and software components try really hard to tell you that they are going to fail, sometimes even a long time before they fail. ECC memory will keep trying to correct bit errors, hard disks will try to correct read/write errors, and all those attempts get logged somewhere. Operating systems and applications tend to log interesting things when they think they have problems. Detecting the interesting events and escalating them to e-mail, SMS or pagers gives you a fair shot at resolving a problem before it causes an outage. I'd much rather get paged by a server that thinks it is going to die, rather than from one that is already dead.

Reasonable Security. A security event is an unavailability event. The Slammer Worm was an availability problem, not a security problem. A security event that results in lost or stolen data will also be an availability event.

Systematic Test and QA. You have the ability to test patches, upgrades and changes, right? Any offline test environment is better than no test environment, even if the test environment isn't exactly identical to production. But as your availability requirements go up, your test and QA become far more important, and at higher availability requirements, test and QA that matches production as closely as possible is critical.

Simple Change Management. This can be really simple, but it is essential. A notice to affected parties, even if the change is supposed to be non-disruptive, a checklist if exactly what you are going to do, and a check list of exactly how you are going to undo what you just did are the first steps. You need to know who changed what and when and why they changed it. If you have no change process, a simple system will improve availability. If you already have a system, making it more complex might or might not improve availability.

Neat cabling and racks. You cannot reliably touch a rack that is a mess. You'll break something that wasn't supposed to be down, you'll disconnect the wrong wire and generally wreak havoc on your production systems.

Installation Testing. You know that server has functional network multipathing and redundant SAN connections because you tested them when you installed them, and you made sure that when you forced a path failure during your test you got alerted by your monitoring system, right?

Reducing MTTR

Once the system is down, the clock starts ticking. Obviously your system is unavailable until you resolve the problem, so any time shaved of the problem resolution part of the equation increases availability.

Structured Systems Management and Change Management. You know where your docs are, right? And you can access them from home, in the middle of the night, right? You also know what last changed and who changed it, right? You know what sever is on what jack on what switch, right? Finding that out after you are down adds to your MTTR.

Clustering, active/passive fail over. In many cases, the fastest MTTR is a simple active passive fail over. Pairs of redundant firewalls, load balanced app servers, and clustered database and file servers do not improve your MTBF. They may, in fact, because of the complexity factor, increase your likelihood of failure. But when properly configured, they greatly decrease the time it takes to resolve the failure. A failed component on a non-redundant server is probably an 8 hour outage, when you consider the time to detect the problem alert on call staff in the middle of the night, call a vendor, wait for parts and installation, and restart the server. That would likely be a 3 minute cluster fail over on a simple active/passive fail over pair.

Failover doesn't always work though. We have situations where the problem was ambiguous, service is affected, but the clustering heartbeat is not affected, so the inactive device or server doesn't come on line. In that case, the MTTR is the time that is take for your system manager to figure out that the cluster isn't going to fail over on its own, remote console in to the lights-out board on the mis-behaving server, and disable it. The passive node then has an unambiguous decision to make, and odds are it will make the decision to become active.

Monitoring and logging. Your predictive failure messages, syslog messages, netflow logs, firewall logs, event logs and perfmon data will all be essential to quick problem resolution. If you don't have logging and alerting on simple parameters like high CPU, disks full, memory ECC error counts, process failures, of if you can't readily access that information, your response time will be increased by the amount of time it takes to gather that simple data.

Paging, on-call. You people need to know when your stuff fails, and if they are on call, they need a good set of tools at home, ready to use. Your MTTR clock starts when the system fails. You start resolving the problem when your people are connect to their toolkit, ready to start troubleshooting.

Remote Consoles and KVM's. You really need remote access to your servers, even if you work in a cube right next the data center. You don't live next to the server, right? And you probably only work 12 hours a day, so for the other 12 hours a day you need remote access.

Service Contracts and Spares. Your ability to quickly resolve an outage likely will depend on how soon you can bring your vendors tech support on line. That means having up to date support contracts, with phone numbers and escalation processes documented and readily available. It also means that you need to need to have appropriate contracts in place at levels of support that match your availability requirements. Your vendors need to be on the hook at least as bad as you are, and you need to have worked out the process for escalating to them before you have a failure.

Tested Recovery. You need know, because you have tested and documented, how you are going to recover a failed sever or device. You cannot figure it out on the fly. Your MTTR window isn’t long enough.

Related Posts:
Estimating Availability of Simple Systems – Introduction
Estimating Availability of Simple Systems - Non-redundant
Estimating Availability of Simple Systems - Redundant
Availability, Complexity and the Person-Factor
(2008-07-19 -Updated links, minor edits)

Sunday, March 23, 2008

A Redundant Array of Inexpensive Flash Drives

A redundant ZFS volume on flash drives? A RAID set on flash drives obviously has no practical value right now, but perhaps someday small redundant storage devices will be common. So why not experiment a bit and see what they look like.

Start with an old Sun desktop, Solaris 10 08/07, and a cheap USB 2.0 PCI card. Add a powered USB hub and a handful of cheap 1GB flash drives.

Plug it in, watch it light up.



If you've got functioning USB drivers on your Solaris install you should see them recognized by the kernel and the links in /dev should automatically get created. USB events get logged to /var/adm/messages, so a quick check there should tell you if the flash drives are recognized. If you can't figure out what the drives got named in /dev, you should be able to match up the descriptions in /var/adm/messages. In my case, they ended up as c2t0d0, c3t0d0, c4t0d0, c5t0d0.

For this project, I didn't want the volume manager to automatically mount the drives, so I temporarily stopped volume management.

# svcs volfs
STATE STIME FMRI
online 21:50:41 svc:/system/filesystem/volfs:default

#svcadm disable volfs

# svcs volfs
STATE STIME FMRI
disabled 22:38:15 svc:/system/filesystem/volfs:default


I labeled each of the drives


# fdisk -E /dev/rdsk/c2t0d0s2
# fdisk -E /dev/rdsk/c3t0d0s2
# fdisk -E /dev/rdsk/c4t0d0s2
# fdisk -E /dev/rdsk/c5t0d0s2


Then I created a RAIDZ pool called 'test' using all four drives.

zpool create test raidz c2t0d0s2 c3t0d0s2 c4t0d0s2 c5t0d0s2

The pool got built and mounted in a couple seconds.



# zpool status
pool: test
state: ONLINE
scrub: scrub completed with 0 errors on Sun Mar 23 21:55:32 2008
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t0d0s2 ONLINE 0 0 0
c3t0d0s2 ONLINE 0 0 0
c4t0d0s2 ONLINE 0 0 0
c5t0d0s2 ONLINE 0 0 0

errors: No known data errors


The file system shows up as available


# zfs list
NAME USED AVAIL REFER MOUNTPOINT
test 105K 2.81G 36.7K /test


I'm not sure what I'm going to do with it yet. It certainly isn't something I'll carry around in my pocket.

But if someone would make a USB stick that was the size of a current USB drive, and let me plug 4 or 5 8GB micro SD's in to it, and build the RAID magic into the drive, I'd have a whole bunch of storage in a pocketable form factor and redundancy as a bonus.

Availability, Complexity and the Person-Factor

I am trying out a new hypothesis:

When person-resources are constrained, highest availability is achieved when the system is designed with the minimum complexity necessary to meet availability requirements.
My hypothesis, that minimizing complexity maximizes availability, assumes that in an environment where the number of persons is constrained or fixed, as systems become more complex the human factors in system failure and resolution become more important than technology factors.

This hypothesis also assumes that increased system availability generally presumes an increase in complexity. I am basing this on a combination of a simple analysis of availability combined with extensive experience managing technology.

Person Resources vs Complexity

  • As availability requirements are increased, the technology required becomes more complex.
  • As the technology gets more complex, the person-resources to manage the technology increases.
  • Resources are generally constrained, so the ideal resource allocation is unlikely to occur in the real world.



In an ideal organization, as system availability requirements go up, both person-resources and technology resources would increase as necessary to support the increased availability requirement. The relationship between availability and person resources should look something like the first chart. In theory the initial investment in structured system management and simple redundancy will result in a large improvement in availability relative to the resources spent. Moving from ad-hoc system management toward structured system management will result in fewer unplanned downtimes and less time spent troubleshooting problems, so both MTBF and MTTR should improve. Moving from non-redundant systems to simple redundancy (load balanced app servers, active/passive firewall and network failover, active/passive clustering, etc) will result in faster recovery time on failures, so even though the MTBF will not improve, MTTR will improve, therefore availability will improve.

When simple active/passive redundancy is no longer adequate to achieve required availability, system complexity is greatly increased. Availability targets that require active/active clustering, multi-homed servers, redundant data centers, layer-2 network redundancy with sub-minute recovery times require more person-resource relative to the resulting increase in availability. If person-resources are added along with the necessary technology resources, the availability will continue to increase.

If however, person-resources are not available to support the increase complexity brought on by the increased availability requirements, the availability curve will look something like the second chart. The systems will be complex to manage, but the existing person-resources, if not supplemented, will be unable to adequately design, test, deploy the more complex environment. Most importantly though, in the event of a failure of the more complex environment, more time will be spent troubleshooting and resolving problems, potentially increasing MTTR and decreasing availability.

This may be nothing more that a restatement of the K.I.S.S principle.

Related Posts:

Estimating Availability of Simple Systems – Introduction

Estimating Availability of Simple Systems - Non-redundant

Estimating Availability of Simple Systems – Redundant

Availability, Longer MTBF and shorter MTTR

Availability - MTBF, MTTR and the Human Factor

(2008-07-19 -Updated links, minor edits)

Tuesday, March 18, 2008

Tethered or Untethered?

I don't want to be tethered.

From The Free Dictionary:
A rope, chain, or similar restraint for holding an animal in place, allowing a short radius in which it can move about.

I don't think I am an animal, at least in the sense that animals are somehow distinguished from humans.

How about:
A similar ropelike restraint used as a safety measure, especially for young children and astronauts.
Nope - I'm not a young child, nor an astronaut. Maybe I wanted to be an astronaut when I was a young child, but that doesn't count.

Try:
The extent or limit of one's resources, abilities, or endurance: drought-stricken farmers at the end of their tether.
I might be at the end of my tether. But only because I'm tethered when I want to be untethered.

When we first connected two computers together to form a network, we created a connection, or a tether, that mechanically bound the computers to each other. That binding lasted until early wireless technology permitted a computer to be on a network without being wired to that network. Now we freely roam about on wireless networks, untethered. Sort of. We still have to be near a wireless network access point, but that's not to hard do do now days.

So we are untethered, right? Not quite -we still need our power cords, wall warts, power bricks and international adapter plugs. But to be tethered to a wall only by power requirements is a huge advancement the overall state of personal freedom. We can go for a couple hours at time freely moving about the home, workplace or nerd conference, and enjoy the ability to move about unhindered by a 'rope, chain or similar restraint'.

Until the energy runs out. Then we frantically re-attach ourselves to our tethers, this time the -/+ 12vdc ones. And as the day goes on, and we burn our battery reserves, we find ourselves budgeting our wall-connect time, scrambling for conference tables near the wall jack, pulling extension cords out of oversized briefcases (or undersized steamer trunks on wheels) in a desperate attempt to stay attached to the wireless network until the last BOF of the day. Yep - we frantically scramble to wire ourselves to a wall so that we can stay connected to a wireless network.

Of course as the day goes on, we find that the ratio wall-tethered to untethered time isn't quite enough to keep us alive all day, and we inevitably enter the downhill spiral of ever shorter periods of untethered freedom followed by ever longer attachments to wall plugs, until we finally can leave our leash and our oversized briefcases (or undersized steamer trunks on wheels) in the hotel room and enjoy the conference.

Somewhere along the line we've decided to trade power or performance for battery life, almost as though we are afraid to be untethered for more than 2 or 4 hours. The thought of being completely disconnected, free to move more that 3 meters from a wall for more than 2 hours, must scare us. Or else we'd have figured out a solution by now.

There simply are no contemporary laptop computers that have reasonable battery life, where reasonable is defined as somewhere close to a normal day. The best we've come up with is 2 hours of freedom for $1000, or 4 hours of freedom for $2000. Or in the case of my home notebook, one hour of freedom for $500.

Annoying.

Monday, March 17, 2008

Security and Availability

A theory from Amrit Williams:

Companies spend money on security when:

  • They have a security incident.
  • They have to comply with a regulation or mandate.
  • The lack of security affects availability.

I don't disagree with this at all. But if true, I must be lucky. Where I work, most of us believe that as custodians of other peoples data, we have a professional and moral obligation to protect that data from exposure or alteration. We also believe that as custodians of other peoples tax dollars, we have an obligation to make wise, frugal choices on what resources we spend protecting other peoples data.

I like it that way.

Introducing a new technology to an enterprise (ZFS)

The introduction of something as critical as a new file system results in an interesting exercise in introducing and managing new technology. Like most small or medium sized shops, we have limitations on our ability to experiment, test and QA new technology. Our engineering and operations staff together is a small handful of persons per technology. Dedicated test labs barely exist and all of our people have daily operational and on-call roles with no formal 'play' time. Spending large blocks of time on things that are too far ahead of where we are today isn't feasible. Yet the pace of technology introduction dictates that we do not slide too far behind the curve on things that are critical to our enterprise.

So how do you go about introducing something this critical to an enterprise under that sort of constraint? We try to find a mix of caution, mitigated risk taking and methodical deployment. Our resources do not permit dedicated test staff or formal test plans, so we compensate and reduce risk by methodical and measured deployment.

Introducing ZFS:

I'm assuming that most would agree that file systems are probably the most critical technology that IT professionals manage. Networks tend to be tolerant of occasional loss of packets or scrambled bits. The IP protocol stack tends to be tolerant of that sort of thing, having been designed with enough resiliency built in to recover from all sorts of errors. A file system isn't quite like that. Failure or corruption of a critical file system is certain to be an event that you'll not want to have happen too often in your career. New file systems don't come along very often, and because we tend to be rather risk adverse on things like file systems, they tend to be difficult to introduce into an enterprise.

ZFS promises to be significant, but because it is radically different from previous Sun file systems, we have to assume that it will have bugs & need time to get sorted out, and we assume that we need time to get our skill set an operational proficiency focused on the new technology.

The process that we use to introduce the technology will be critical to future availability and performance of our systems.

Here's the path we took:
  • Sanity check
  • Test/lab environment
  • Limited deploy, non critical, non-customer, low I/O load
  • Limited deploy, non critical, non-customer, high I/O load
  • Limited deploy, critical, customer, low I/O load.
  • Limited deploy, critical, customer, high I/O load.
  • General deployment

Sanity Check:


We started out with a simple sanity check on the technology.

  • Does it offer significant advantages over current technology?
  • Does it appear to solve an identified operational or security problem?
  • Could a rough cost/benefit be calculated base on an initial review of the technology?
  • Is this the strategic direction for the vendor, and are we aligned with that vendors strategic direction?
  • Are the vendors claims reasonable and verifiable?
  • Will we be able to manage the technology?
  • Will we be able to replace or deprecate some other technology, or is a duplicate of existing technology?

A pass through the sanity check indicated that we ought to at least spend a few spare cycles looking at ZFS. We spend significant effort in managing UFS file systems using traditional logical volume managers, and we have pain points around dynamically adding & removing disk space for databases and applications. Our current LVM model looks very much like pooled storage, but with the overhead of having to manually manage extents withing logical volumes. The promise of pool based storage, similar to our EVA's but at the operating system layer, looked interesting. Sun claimed commitment to ZFS, and we have a significant investment in Sun technology.

So we started to 'play' around with ZFS on a low-priority, off hours basis, to determine if the excitement surrounding the technology was justified, and more importantly, would the technology fit with, and add value to our enterprise hosting service. All the exercises outlined here were informal & adhoc.

Lab #1:

Our initial exposure to ZFS was a simple series of informal tests on test severs running early access or developer preview ZFS code. We built pools and file systems, first on pseudo mounts of ordinary files on UFS file systems, then on real disk slices. The initial tests were mostly just replicating the simple examples that Sun engineers and others posted about in their blogs. We built & destroyed the pools and file systems, intentionally failed disks and lun's, snapped & cloned, and otherwise explored the basic feature set. Based on these initial tests, we concluded that even this early, the technology was roughly as manageable as our existing technology, and that the potential for simplifying disk management might make the cost of implementation recoverable. In short, it was interesting enough to take a look at in a bit more detail.

Lab #2:

If lab #1 was a simple review of what the reviewers already blogged about, lab #2 was intended to explore the edges a little bit more, primarily looking at how gracefully the file system would fail. If we thought that the technology had well defined edges and would fail predictably, or at least at in a predictable and recoverable manner.

Enter the 'RAIF'. (Redundant Array of Inexpensive Flash devices). USB flash drives have some interesting properties. They are cheap, easy to plug in, configure, shuffle around, and they can easily be moved to other computers for testing write failures and to introduce data corruption We looked at building temporary SCSI arrays, but for various time, space and power reasons, we picked a USB based test platform. The bill of materials was something like:

  • Two cheap USB adapters
  • Two powered 4 port USB hubs
  • 8 USB flash drives of various size (whatever was cheap.)
That was enough to build a handful of different ZFS pools in various configurations, and easily test physical and logical failure modes. The USB drives were pretty good at inducing failures in the file system, so they made a good platform for testing the general resiliency of ZFS and gave us a pretty good idea how well Sun thought through the edges of the technology and how well the file system was at managing it own failure modes and cover cases. The file system recovered when we expected it to, and failed when we expected it to, and vendor claims were generally verified by our test. (The performance of the USB driver stack was not considered part of the test.)

Our conclusion was that we should keep looking at the technology on a low priority basis.

Lab #3

From the RAIF we went to a more conventional file system test on ordinary SCSI drives. The goal of this test was to simulate disk I/O load with ordinary test & benchmark suites and compare ZFS to UFS under something that resembles ordinary applications. A series of benchmark-like tests indicated to us that the technology lived up to it claims at least as well as any other new technology, and we agreed that the technology might be valuable to us if we could manage it and if it could be used to replace the UFS file systems that are under a logical volume manager.

Deployment #1, low I/O, low impact.

Eventually, as time permitted, we decided to try a file system on a server that was active, in production, but wasn't failure critical. We essentially gave ZFS a 'test run' by using it on a few severs that are part of our sever management infrastructure. Any pain would be felt by our peers, not our customers. If we felt no pain, we could keep moving. By this time Sun had added ZFS to Solaris 10, so we could move ahead on a low impact server and be fully supported by Sun.

The first production obstacle was vendor support for 3rd party tools and utilities on ZFS. Legato qualified ZFS just about the time that we needed it, and our management infrastructure in mostly home-grown, so we didn't have other significant software compatibility issues. The file system performed as designed under the various low use, low impact environments.

Deployment #2, high I/O, low impact.

Our next step was to start using ZFS in places where we have interesting I/O loads, but where we don't have data that is absolutely irreplaceable. At the time that we were ready to move forward with another ZFS implementation, we were also re-engineering our enterprise backup to use a disk pool as a staging area for the tape backups. This gave us an opportunity to test ZFS in parallel to the technology it replaced at greatly reduced risk.

Our first large, high I/O ZFS implementation was a FATA disk pool on an EVA8000 that we use as the staging area for server backup jobs. Because we were in a position where if it didn't work we could back out and re-configure fairly easily, we took advantage of the opportunity and went with ZFS. We started out with a single 2TB lun, so ZFS sees the lun as one large disk, not many small disks.The performance was excellent, and because it was trouble free, we used ZFS for the entire disk pool.

This pool is now more or less made up of 5 2TB luns in a single pool, for a total of 10TB. We initially write all Legato save sets to this disk pool, then clone them out to other media, both real tape & virtual tape. So far, that ZFS file system has performed very well. We routinely read & write well over 100MBps to the pool, or somewhere around 10 TB per weekend, with no significant issues directly related to the file system. (There are kernel issues indirectly related to the file system that sometimes affect performance, but the file system itself works.)

ZFS pool #2 is a syslog server. We spool tens of thousands of syslog, apache and netflow logs per second to a 2+TB disk pool on older Sun storage on a first generation T2000. That pool works as expected.

Both of these are systems would be non-customer affecting if they failed. Backups can be re-run, and logs can be recovered from tape.

Deployment#3, Low I/O, high impact;

We have other ZFS file systems in various spots where we have opportunity to experiment, but now we are using ZFS for production, customer impact systems, but not on production databases. The customer facing implementations are all covered by load balancing or some other non-file system dependent redundancy. So far they all work as expected. We are also working through the details of hosting zones on ZFS file systems, with the intent of giving us more flexibility in hosting zoned applications. We have not yet put large Oracle instances on ZFS. For us, database file systems are the most critical, so we are the most cautious.

Future -

High I/O, high impact.


We are exercising ZFS and gaining enough operational experience that we should soon be comfortable with moving toward general, unrestricted deployment on ZFS. Our next implementation should be a customer facing, high I/O application, but probably not a database sever. Right now we do not have an application that fits those requirements, so this phase is delayed.

General Deployment

Barring any major problems with the current ZFS implementations, and the above mentioned kernel issue aside, it looks to us like ZFS, will eventually be our default file system, to be used generally across Solaris servers. Our rollout has spanned almost a couple of years, but we have had no setbacks, compatibility problems, outages or data loss related to any of the ZFS implementations.

Friday, March 14, 2008

Unconstraining a constrained resource

When a technology that is normally constrained is presented with unlimited resources, what will it do?

We've had a couple of interesting examples of what happens when software that is normally constrained by memory has that constraint removed. The results were interesting, or amusing, depending on your point of view.

Too much memory

The first time we ran into something like this was when we installed a new SQL server cluster on IBM x460 hardware. We bought the server to resolve a serious database performance problem on an application that was growing faster than we could purchase and install new hardware. To get ahead the growth curve, we figured that we had to more than double the performance of the database server. And of course while we were at it, we'd cluster it also. And because HP wouldn't support Windows 2003 Datacenter Edition on an HP EVA, we ended up buying a an IBM DS4800 also. And because we had neither the time nor the resources to spend endless hours tuning the I/O, we'd get enough memory that the database working set was always in memory. And no, the application vendor wouldn't support any form of horizontal scalability whatsoever.

So after a month of research, a month to get a decision, a month to get through our purchasing process, and a month for IBM to build the server, we got two new x460 16-CPU Xeon servers, a DS4800, all pre-racked and ready to boot. The servers hit the new datacenter floor at noon on Dec 24th. The application had to be moved to the new datacenter on the new servers and be ready for production on January 6th. We were stressed. The schedule was something like Day 1: build severs. Day 2: build SAN. Day 3 cluster and failure testing (and so on ....)

When we got to 'Day4: Install SQL server 2000' we ran into a pretty significant issue. It wouldn't run. SQL server 2000 simply would not start. It gave a cryptic error message and stopped. Microsoft Tier 3 SQL support to the rescue. They never heard of THAT error message. Escalate...wait...escalate...wait...bring IBM Advanced Support into the conference call...escalate..wait...now there are two of us and 6 of them on the call...finally:
'We've got a bug ID on SQL server 2005, that sounds similar. But it only occurs when you run a 32 bit database on a server with more than 64GB of memory.'
Hmm.... we are on 32 bit 2000, not 2005, but we do have 128GB of memory, so maybe?

The workaround suggested was to remove memory. But now it is late on a Friday, and we are going live next Wednesday, and spending hours carefully removing memory sounded like a headache. Fortunately /burnmem worked, the server only saw 64GB of memory, and SQL2000 was happy to run, though slightly confused. The cut over was successful, and we survived spring semester start up with a database that ran 16 CPU's at 70% busy on our peak day (instead of an 8 CPU database server at 140% CPU).

It probably never occurred to the SQL server developers, back when the database memory model was designed and database boot/init/load code was written, that a customer would try to run a 32-bit database with 128GB of memory. No way. What did the software do? It croaked. It rolled over and died. It didn't even try.

(That server lasted a semester. The application utilization continued to grow, and the two 16 CPU servers became 16 dual cores @ 50% utilization before the start of the next semester.)

Too much memory - Part 2

Fast forward (or move the slider bar on your OS X Time Machine) a year and a half. The app is still growing. A couple rounds of tuning, vendor code re-writes, and a couple of Microsoft engineering on-sites tells us that we really, really want to be running this thing with all 64 bits & SQL server 2005.

So we take the plunge. Windows 2003 Datacenter, x64, this time with SQL server 2005, all 64 bits, and all 128GB of memory. Life is good, right?

Not quite.

An interesting question: What does SQL Server 2005 64bit do when you feed it an application that forces it to parse, compile and cache hundreds of new queries every second, when the queries are such that they can't be parameterized? It parses, complies and caches all of them. As long as the procedure cache is constrained, all you get is an inefficient database sever and a high cache curn rate. But it works. When there is no constraint? It's procedure cache gets too big, and it gets really, really unhappy. And so do the hundred thousand students who thought they were going to use the application that day.

As best as we can figure, based on a 13 hour tech support call and handful of PSSDiags & such, we fed new, unparameterizable queries into the database engine, it cached them, and when the procedure cache got up somewhere around 8GB, the server pretty much spent all its cycles mucking around with managing its cache at the expense of doing useful database-like work. The cache never got trimmed or purged, probably because whatever process does that didn't see a reason to expire old cache entries. There was plenty of memory.

Fix the app, patch the database servers, and monitor the procedure cache, and all is well.

When a technology that is normally constrained is presented with unlimited resources, what will it do?

It'll probably croak.

Thursday, March 13, 2008

Autodeploying Servers - A Proof of Concept

A couple years ago, during the brief pause in the middle of a semester, I figured that some time spent on re-thinking how we deploy remote servers might be worth time spent. Fortunately we have sharp syadmin's who like challenges.

Our network is run by a very small group that has to cover a rather large state, 7x 24. Driving a 600 mile round trip to swap some gear out isn't exactly fun, so we tend to be very cautious about what we deploy & where we deploy it. I've always had a strong preference for hardware that runs forever, and my background in mechanical sorts of things tells me that moving parts are generally bad. So If I have a choice, and if the device is a long ways away, I'll pick a device that boots from flash over an HDD any time.

That sort of works in with my thoughts on installing devices in general, that we ought to design & build to the 'least bit' principle, where we minimize the software footprint, file system permissions, ports, protocols, etc, as much as possible and still maintain required functionality.

Anyway, at the time, we'd had Snort IDS's out state for quite a while, and we were thinking that they needed a hardware swap. But I wasn't enthused about installing more spinning media 5 hours from home.

So that led to a proof-of-concept.
  • How close could we come, with no money, to making a IDS that had no moving parts?
  • How minimal could the software footprint be?
  • How would we reliably upgrade, patch and maintain the remote IDS's without travel?
  • Could we make the remote IDS's 'self deploying' (with no persistent per-device configuration)
  • Could we run and Intel PC headless (no keyboard/monitor?) and manage it via 9600,n,8,1?
Knoppix

We started by taking a look at the CD-bootable Knoppix. At the time, it didn't like being run headless. There was no access to the serial port early in the boot process, so if it didn't boot, we wouldn't know why. That doesn't work too well for remote devices, but fortunately there was enough prior work in that space to get us a bootable Knoppix that lit up the serial port early enough to make us happy. Booting from flash would have been better, but a spinning CD is way closer to 'no moving parts' than a spinning HDD, and we didn't have any no-cost flash-bootable computers laying around.

It was still way too fat to fit the 'least bit' principle, and there still was no easy way of maintaining the Snort binaries and configs without touching each server. I didn't want to have to do that. So we stripped Knoppix down to something reasonably sized, convinced it to look at the serial port instead of the keyboard, burned a new image, and took a look at how to maintain Snort.

We had just prototyped a process that used SVN to manage and deploy the binaries for a J2EE application, so why not look at SVN for Snort? Check the Snort RPMs, compiled binaries, libraries and rules into SVN, and when they change, deploy from SVN to the PC acting as the IDS. Snort has logs, but it can log to a remote database, so persistent, writable storage is really not necessary. The spinning media problem more or less goes away. So far, so good.

Self Deploy

A long time ago, Cisco made routers that were smart enough to an ARP/RARP/BOOTP & tftp to find a config. That meant that if the router, brand new and unconfigured, were attached to a network and found a tftp server with a 'router.conf' on it, the router would boot that config. That made it possible to buy a new router, ship it directly to a remote side, and with no router-savy person on site, install the router unconfigured, have it 'self configure' and become a useful router. All you needed was an upstream network device (like the router on the near side of the T1) that was smart enough to feed a couple of bits of information to the new, unconfigured router, and the router would 'self deploy'.

So I wanted our new toys to 'self deploy'. Send a new, unconfigured, generic PC to a remote site, plug it in and have it learn enough from the network to figure out who it is and what it's supposed to do.

Having a minimal Knoppix-like boot/run CD that talked to its serial port in a reasonable manner in hand, we figured that the network could feed the server enough information to make it useful, like a IP address and gateway. But getting Snort and related bits on the new server need some more thinking. If we put Snort on the CD, we'd always have to burn new CD's when Snort needed updating, and we'd still have to manage the constantly changing Snort rules. That would either require lots of CD burning or persistent writable storage.

What we came up with was - put the Snort packages in SVN at the home base, add an SVN client to the boot CD, and install Snort and related dependencies from SVN on the fly on each boot. The remote server would boot from CD, learn who it is from the network (DHCP), figure out where its SVN repository was (either DNS or DHCP options), look up its MAC address in the SVN repository, and check out whatever it saw. If what it saw were Snort binaries, it checked them out & installed them.

Boot -> Learn -> Install -> Run.

To upgrade the boot image, we'd still have to ship a CD to the remote site. But that doesn't happen too often. To upgrade Snort, we'd check a new version of Snort into the master SVN repository at the home base & reboot the remote servers. They'd learn about it on bootup. If we wanted to tweak the config on a sensor, we'd check a new config into SVN and reboot the remote.

Pretty simple.

And - If we didn't want the remote to be a Snort sensor anymore, we could check some other bits into SVN, reboot the remotes & they happily check out, install & run the new bits. Turn it into a web server? Just check in new binaries & reboot the remote.

It worked.

But then real projects came along, so we never deployed it.

Saturday, March 8, 2008

On Least Bit Installations and Structured Application Deployment

I ran across this post by 'ADK'.

It's not done until it's deployed and working!

This seems so obvious, yet my experience hosting a couple large ERP-like OLTP applications indicates that sometimes the obvious needs to be made obvious. Because obviously it isn't obvious.

A few years ago, when I inherited the app server environment for one of the ERP-like applications (from another workgroup), I decided to take a look around and see how things were installed, secured, managed and monitored. The punch list of things to fix got pretty long, pretty quickly. Other than the obvious 'everyone is root' and 'everything runs as root' type of problems, one big red flag was the deployment process. We had two production app servers. They were not identical. Not even close. The simplest things, like the operating system rev & patch level, the location of configuration files, file system rights and application installation locations were different, and in probably the most amusing and dangerous application configuration I've ever seen, the critical config files on one of the app servers was in the samples directory.

So how do we fix a mess like that? We started over, twice (or three times).

Our first pass through was intended to to fix obvious security problems and to make all servers identical. We deployed a standard vendor operating system installation on four identical servers, 1 QA, 3 production, deployed the necessary JBoss builds on each of the servers in a fairly standard install, and migrated the application to the new servers. This fixed some immediate and glaring problems and gave us the experience we needed to take on the next couple of steps. 

The second pass through the environment was designed to get us closer to the 'least bit' installation best practice, clean up some ugly library and application dependencies, and develop a standardized, scripted deployment, guaranteed to be identical on each servers, and guaranteed to be identical to test/QA.

The first part, the 'least bit' installation, simply means that in the JBoss directory structure, if one were to remove a single bit from file system permissions, config files, or the application itself, the app would break. This ensures that file system permissions are optimal and that the application has no extra junk (sample configs, sample JDBC connections) laying around that can cause security or availability problems.

The application deployment process that we developed was very interesting. We decided that we wanted to have a completely version controlled deployment, that included all files that the entire application needs for functionality other than the vendor provided operating system files.  We checked the entire JBoss application, including the binaries, configs, war's, jar's & whatever, into Subversion (SVN). The deployment is essentially a process that checks out the entire application from SVN, drops it onto a production app server and removes the entire old application. The idea is that we now know not only what is deployed on each server, and that all servers are identical, we also know exactly what was deployed on all servers at any arbitrary point in the past. 

The process to get the application ready to deploy is also version controlled and scripted. Our dev team builds a JBoss instance and their application binaries and configs using their normal dev tool kit. When they think that they have a deployable application, including the code they wrote, the JBoss application and any configuration files, they package up the application, check it into a special deployment SVN repository, and deploy the entire application to a QA/test environment.  They've got the ability to deploy to the test/QA servers any time they want, but they do not have the ability to modify the QA/test environment other than through the scripted, version controlled deployment process. If the app doesn't work, they re-configure, bug fix, check it back into the deployment SVN repository and re-deploy to QA/test.  Once the app is QA'd and tested, a deployment czar in the dev team modifies any config files that need to be different between QA and prod (like database connect strings) and commits the application back into the deployment SVN repository.

Then we get a note to 'deploy version 1652 to prod next Monday'. On Monday, we take each app server offline, in turn,  run scripts that archive the old installation, copy the entire new application to the production server, test it, and bring the app server back on line. Repeat once for each app server & we are done. 

We made a third pass through the app servers. This time we changed the OS version, the platform and we implemented basic Solaris 'Zones'  to give us a sort of pseudo-virtualization, and we applied the 'least bit' principle to the operating system itself (reducing its disk footprint from 6GB down to 1GB). We also, as a result of the systematic deployment process, have the ability to light up new app servers fairly easily. A Solaris WAN boot from a standard image, a bit of one-time tweaking of the OS (hostname, IP address) and a deploy of the JBoss app from SVN gets us pretty close to a production app server. 

We have a ways to go yet. Some parts of the deploy process are clunky, and our testing methodology is pretty painful and time consuming, and we are not fully 'containerized' into Solaris zones.  The dev team wants at least one more pre-production environment, and we need to make the deployment process less time consuming. 

We aren't done yet.

2008-03-08

Tuesday, March 4, 2008

Availability - MTBF, MTTR and the Human Factor

We've got a couple of data centers. By most measures, they are small, at just over 100 servers between the two of them. We've been working for the last year to build the second data center with the specific design goal of being able to fail over mission critical applications to the secondary data center during extended outages. Since the goal of the second data center is specifically to improve application availability, I decided to rough in some calculations on what kind of design we'd need to meet our availability and recovery goals.

Start with MTBF (Mean Time Between Failure). In theory, each component that makes up the infrastructure that supports an application has an MTBF, typically measured in thousands of hours. The second factor in calculating availability is MTTR. (Mean Time to Recovery or Repair), measured in hours or minutes. Knowing the average time between failures, and the average time to repair or recover from the failure, one should be able to make and approximate prediction of the availability of a particular component.

Systems, no matter how complex, are made up of components. In theory, one can calculate the MTBF and MTTR of each component and derive the availability of an entire system. This can get to be a fairly involved calculation, and I'm sure somewhere there are people who make a living doing MTBF/MTTR calculations.

If this is at all interesting, read eventhelix.com's description of MTBF/MTTR and availability. Then follow along with my calculations, and amuse yourself at the conclusion. Remember, all the calculations are real rough. I was trying to get an order of magnitude estimate based on easily obtainable data, not an exact number.

Raw MTBF Data:

Single Hard Drive, high end manufacturer: 1.4 million hours
Single DIMM, moderately priced manufacturer: 4 million hours
Single CPU: 1 million hours
Single network device (switch): 200,000 hours
Single person, non-redundant: 2000 hours

OK - the Single Person, non-redundant I just made up. But I figure that each employee probably will, if left alone with no constraints (change review, change management) will screw up once a year.

MTTR is simply an estimate of how long it takes to restore service in the event of a failure. For example, in the case of a failed non-redundant HDD that has a server OS and lots of data on it, that the MTTR would be the time it takes to call the vendor, have a spare delivered, rebuild the server and recover everything from tape backup. Figure 8 hours for a simple server. Or, in the case of a memory DIMM, I figure on hour to get through the tech support queue at the manufacturer, one hour to argue with them about your contract terms & conditions, 4 hours to deliver the DIMM, an hour to install and re-boot, or about 7 hours.

In the case of redundant or clustered devices, the MTTR is effectively the time it takes for the clustering or high availability software to figure out what has failed and take the failed device off line or route around it. Figure somewhere between 1 and 3 minutes.

To convert MTBF and MTTR to availability, some math is needed.

If systems have more than one component, and if all components are required for the system to operate, then the availability of the system is calculated by multiplying the availability of each component together to obtain the system availability. If the components are redundant, and if each of the redundant components is fully functional alone, then the availability of the system is (1 - (1-A)n). (Or - use an online calculator.)

Obviously in the case of redundant components, the system is only failed when both redundant components fail at the same time. That tends to be rare, so redundant systems can have much better availability.

How does this work out using some rough numbers?

First of all, an interesting problem immediately becomes evident. For maximum performance, I'd build a server with more CPU's, more memory DIMMS, more network cards, and more disk controllers. The component count goes up. If I build the server without redundancy, the server is less reliable. But faster. Most server vendors compensate for this by building some components with error correction capability, so that even thought component fails, the system still runs. But more components, in theory, will result in a less reliable server. So in rough numbers, using only memory DIMMs in the calculation, and assuming no error correction or redundancy:

DIMM Count MTBF (Hours) MTTR (Hours) Availability
1 4 million 8 hours 99.9998%
4 4 million 8 hours 99.9992%
8 4 million 8 hours 99.9984%
16 4 million 8 hours 99.9968%
32 4 million 8 hours 99.9936%
64 4 million 8 hours 99.9972%

Adding non-redundant components doesn't help availability. A high end server with lots of DIMM's, is already at four 9's, without even considering all the other components in the system.

In clustered, or high availability systems, the availability calculation changes dramatically. In the simple case of active/passive failover, like Cisco firewalls or Microsoft clustering, the MTBF doesn't change, but the MTTR does. The MTTR is essentially the cluster failover time, usually a minute or so. Now the time that it takes the tech to show up with parts is no longer relevant.

Take the same numbers above, but make the MTTR 3 minutes (cluster failover time) and the theoretical availability goes way up.

DIMM Count MTBF (Hours) MTTR (Hours) Availability
1 4 million .05 hours 99.9999%
4 4 million .05 hours 99.9999%
8 4 million .05 hours 99.9999%
16 4 million .05 hours 99.9999%
32 4 million .05 hours 99.9999%
64 4 million .05 hours 99.9999%


Active/passive clustering doesn't change MTBF, but is does change MTTR, and therefore availability. I ran through a handful of calculations, and figure out that if all we have is a reasonable level of active/passive redundancy on devices and servers, and a reasonably well designed network and storage, we should be able to meet our availability goals.

Except......

The Human Factor:

So I was well on my way down this path, trying to determine if Layer-3 WAN redundancy was adequate to meet availability goals, and if active/passive clusters combined with dual switch fabrics and HA load balancers and firewalls will result in MTBF and MTTR's that make sense, when I decided to figure in the human factor.

I figure that the more complex the system, the higher the probability of a human induced failure, and that humans have a failure rate that is based not only on skill and experience, and also on the structure of the processes that are used to manage the systems.

Assume, as above, that a person makes one error per year, and that the error results in 30 minutes of downtime. That already drops you down to 99.99% availability. That's without considering any hardware, software, power or other outages. If you figure that you have a handful of persons that are in a position to make an error that results in downtime, you've got a problem.

Fortunately, persons can be made redundant, and processes (Change Management) can be designed to enforce person-level redundancy.

So now I'm thinking about Change Management, QA environments, test labs.

Related Posts:
Estimating Availability of Simple Systems – Introduction
Estimating Availability of Simple Systems - Non-redundant
Estimating Availability of Simple Systems - Redundant
Availability, Complexity and the Person-Factor
Availability, Longer MTBF and shorter MTTR

(2008-07-19 -Updated links, minor edits)

Sunday, March 2, 2008

The influence of Unix on Windows 2008

Sam Ramji has an excellent technet.com blog post on the influence of 'Open Source' on Windows 2008.

Technet.com also has an interesting interview with Andrew Mason on Windows Server Core that explains some of the fundamentals of Windows 2008.

System admin scripting is finally being elevated to a first class tool for administering servers. This is a fundamental change in direction for Windows system admins, and for me, changes the equation as far as determining the best operating system for deploying an application.

I'm really looking forward to being able to write tools, scripts and utilities to manage Windows servers, instead of the incredibly error-prone human mouse-click & check-box methods that we currently use. Mouse-click management is the worst possible way to ensure that our servers are configured identically and securely.

I'm also looking forward to seeing how close Windows Server Core come to making appliance-like stripped down systems possible to build using the Windows platform. I've always favored Unix-like OS's for building DNS, DHCP and similar light weight, single purpose severs. Right now, our standard Solaris build for a server that is fully functional as an Apache web server or DNS is about a 500MB bootable image. That 500MB gets us a full functional, fully manageable server, suitable for hosting many or most of our applications. If we need a sever that has to run Java, the image size almost doubles, but that is still pretty light weight as far as I am concerned. When we churn through the weekly Solaris security patch reports, we have the wonderful ability to draw a line through the vast majority of them with the simple notation: 'Not vulnerable, package not installed....'.

Today, with a Windows 2003 web or application server, we start out with a 20GB boot disk with about 4+ GB in use, not including the swap file, just to get a bootable server. That bootable server is fully loaded, with far more functionality, features, and vulnerable software than we'll ever use. We spend a day per month analyzing the latest patch Tuesday vulnerability list and figuring out if we have mitigation in place for any of the vulnerabilities. It would sure be nice to check 3/4 of them off the list with a simple -'Not vulnerable, package not installed....'.

It looks like Micrososft is starting to think differently about servers. I'm happy about that.