Skip to main content

Posts

Showing posts from March, 2008

Using RSS for System Status

It's time to re-think how status and availability information gets communicated to system managers. System status is sort of like news, so why not use a news reader? I don't mean using RSS for posting blog-like information to users like Our systems were borked last night from 03:00 to 05:00. That is a great idea, and many people already do a fine job in that space. I mean something more like "server16 HTTP response time greater than 150ms". The nerdy stuff.

Most IT organizations have monitoring systems that attempt to tell us the status of individual system components like servers, databases, routers, etc. The general idea is to make what we think are relevant measurements related to the performance or availability of components, record the measurements and hopefully alert on measurements that fall outside of some boundaries. Some places attempt to aggregate that information into some kind of higher level information that depicts the status of an application or grou…

Availability, Longer MTBF and shorter MTTR

A simple analysis of system availability can be broke down into to two ideas. First, how long will I go, on average, before I have an unexpected system outage or unavailability (MTBF). Second, when I have an outage, how long will it take to restore service.

Any discussion of availability, MTBF, MTTR can quickly descend into endless talk about exact measurements of availability and response time. That sort of discussion would be appropriate in cases where you have availability clauses in contractual obligations or SLA's. What I'll try to do is frame this as a guide to maintaining system availability for an audience that isn't spending consulting dollars, and who is interested in available systems, not SLA contract language.

Less failure means longer MTBF

What can I do to decrease the likelihood of unexpected system down time? Here's my list, in the approximate order that I believe they affect availability.

Structured System Management. You gain availability if you…

A Redundant Array of Inexpensive Flash Drives

A redundant ZFS volume on flash drives? A RAID set on flash drives obviously has no practical value right now, but perhaps someday small redundant storage devices will be common. So why not experiment a bit and see what they look like.

Start with an old Sun desktop, Solaris 10 08/07, and a cheap USB 2.0 PCI card. Add a powered USB hub and a handful of cheap 1GB flash drives.

Plug it in, watch it light up.



If you've got functioning USB drivers on your Solaris install you should see them recognized by the kernel and the links in /dev should automatically get created. USB events get logged to /var/adm/messages, so a quick check there should tell you if the flash drives are recognized. If you can't figure out what the drives got named in /dev, you should be able to match up the descriptions in /var/adm/messages. In my case, they ended up as c2t0d0, c3t0d0, c4t0d0, c5t0d0.

For this project, I didn't want the volume manager to automatically mount the drives, so I temporarily stopp…

Availability, Complexity and the Person-Factor

I am trying out a new hypothesis:

When person-resources are constrained, highest availability is achieved when the system is designed with the minimum complexity necessary to meet availability requirements. My hypothesis, that minimizing complexity maximizes availability, assumes that in an environment where the number of persons is constrained or fixed, as systems become more complex the human factors in system failure and resolution become more important than technology factors.

This hypothesis also assumes that increased system availability generally presumes an increase in complexity. I am basing this on a combination of a simple analysis of availability combined with extensive experience managing technology.

Person Resources vs Complexity

As availability requirements are increased, the technology required becomes more complex. As the technology gets more complex, the person-resources to manage the technology increases. Resources are generally constrained, so the ideal resour…

Tethered or Untethered?

I don't want to be tethered.

From The Free Dictionary:
A rope, chain, or similar restraint for holding an animal in place, allowing a short radius in which it can move about.

I don't think I am an animal, at least in the sense that animals are somehow distinguished from humans.
How about:
A similar ropelike restraint used as a safety measure, especially for young children and astronauts. Nope - I'm not a young child, nor an astronaut. Maybe I wanted to be an astronaut when I was a young child, but that doesn't count.

Try:
The extent or limit of one's resources, abilities, or endurance: drought-stricken farmers at the end of their tether.I might be at the end of my tether. But only because I'm tethered when I want to be untethered.

When we first connected two computers together to form a network, we created a connection, or a tether, that mechanically bound the computers to each other. That binding lasted until early wireless technology permitted a computer to be on a ne…

Security and Availability

A theory from Amrit Williams:
Companies spend money on security when:

They have a security incident.They have to comply with a regulation or mandate.The lack of security affects availability.
I don't disagree with this at all. But if true, I must be lucky. Where I work, most of us believe that as custodians of other peoples data, we have a professional and moral obligation to protect that data from exposure or alteration. We also believe that as custodians of other peoples tax dollars, we have an obligation to make wise, frugal choices on what resources we spend protecting other peoples data.

I like it that way.

Introducing a new technology to an enterprise (ZFS)

The introduction of something as critical as a new file system results in an interesting exercise in introducing and managing new technology. Like most small or medium sized shops, we have limitations on our ability to experiment, test and QA new technology. Our engineering and operations staff together is a small handful of persons per technology. Dedicated test labs barely exist and all of our people have daily operational and on-call roles with no formal 'play' time. Spending large blocks of time on things that are too far ahead of where we are today isn't feasible. Yet the pace of technology introduction dictates that we do not slide too far behind the curve on things that are critical to our enterprise.

So how do you go about introducing something this critical to an enterprise under that sort of constraint? We try to find a mix of caution, mitigated risk taking and methodical deployment. Our resources do not permit dedicated test staff or formal test plans, so we comp…

Unconstraining a constrained resource

When a technology that is normally constrained is presented with unlimited resources, what will it do?

We've had a couple of interesting examples of what happens when software that is normally constrained by memory has that constraint removed. The results were interesting, or amusing, depending on your point of view.

Too much memory

The first time we ran into something like this was when we installed a new SQL server cluster on IBM x460 hardware. We bought the server to resolve a serious database performance problem on an application that was growing faster than we could purchase and install new hardware. To get ahead the growth curve, we figured that we had to more than double the performance of the database server. And of course while we were at it, we'd cluster it also. And because HP wouldn't support Windows 2003 Datacenter Edition on an HP EVA, we ended up buying a an IBM DS4800 also. And because we had neither the time nor the resources to spend endless hours tuning t…

Autodeploying Servers - A Proof of Concept

A couple years ago, during the brief pause in the middle of a semester, I figured that some time spent on re-thinking how we deploy remote servers might be worth time spent. Fortunately we have sharp sysadmin's who like challenges.

Our network is run by a very small group that has to cover a rather large state, 7x 24. Driving a 600 mile round trip to swap some gear out isn't exactly fun, so we tend to be very cautious about what we deploy & where we deploy it. I've always had a strong preference for hardware that runs forever, and my background in mechanical sorts of things tells me that moving parts are generally bad. So If I have a choice, and if the device is a long ways away, I'll pick a device that boots from flash over an HDD any time.
That sort of works in with my thoughts on installing devices in general, that we ought to design & build to the 'least bit' principle, where we minimize the software footprint, file system permissions, ports, proto…

On Least Bit Installations and Structured Application Deployment

I ran across this post by 'ADK'.

It's not done until it's deployed and working!

This seems so obvious, yet my experience hosting a couple large ERP-like OLTP applications indicates that sometimes the obvious needs to be made obvious. Because obviously it isn't obvious.

A few years ago, when I inherited the app server environment for one of the ERP-like applications (from another workgroup), I decided to take a look around and see how things were installed, secured, managed and monitored. The punch list of things to fix got pretty long, pretty quickly. Other than the obvious 'everyone is root' and 'everything runs as root' type of problems, one big red flag was the deployment process. We had two production app servers. They were not identical. Not even close. The simplest things, like the operating system rev & patch level, the location of configuration files, file system rights and application installation locations were different, and in probabl…

Availability - MTBF, MTTR and the Human Factor

We've got a couple of data centers. By most measures, they are small, at just over 100 servers between the two of them. We've been working for the last year to build the second data center with the specific design goal of being able to fail over mission critical applications to the secondary data center during extended outages. Since the goal of the second data center is specifically to improve application availability, I decided to rough in some calculations on what kind of design we'd need to meet our availability and recovery goals.

Start with MTBF (Mean Time Between Failure). In theory, each component that makes up the infrastructure that supports an application has an MTBF, typically measured in thousands of hours. The second factor in calculating availability is MTTR. (Mean Time to Recovery or Repair), measured in hours or minutes. Knowing the average time between failures, and the average time to repair or recover from the failure, one should be able to make and a…

The influence of Unix on Windows 2008

Sam Ramji has an excellent technet.com blog post on the influence of 'Open Source' on Windows 2008.

Technet.com also has an interesting interview with Andrew Mason on Windows Server Core that explains some of the fundamentals of Windows 2008.

System admin scripting is finally being elevated to a first class tool for administering servers. This is a fundamental change in direction for Windows system admins, and for me, changes the equation as far as determining the best operating system for deploying an application.

I'm really looking forward to being able to write tools, scripts and utilities to manage Windows servers, instead of the incredibly error-prone human mouse-click & check-box methods that we currently use. Mouse-click management is the worst possible way to ensure that our servers are configured identically and securely.

I'm also looking forward to seeing how close Windows Server Core come to making appliance-like stripped down systems possible to build us…