Skip to main content

Using RSS for System Status

It's time to re-think how status and availability information gets communicated to system managers. System status is sort of like news, so why not use a news reader? I don't mean using RSS for posting blog-like information to users like Our systems were borked last night from 03:00 to 05:00. That is a great idea, and many people already do a fine job in that space. I mean something more like "server16 HTTP response time greater than 150ms". The nerdy stuff.

Most IT organizations have monitoring systems that attempt to tell us the status of individual system components like servers, databases, routers, etc. The general idea is to make what we think are relevant measurements related to the performance or availability of components, record the measurements and hopefully alert on measurements that fall outside of some boundaries. Some places attempt to aggregate that information into some kind of higher level information that depicts the status of an application or group of related applications or networks.

That status information used to be presented on anything from dedicated X-windows sessions on high end workstations to simple Windows application interfaces. If you wanted to see what was happening, you logged into an application or console of some sort and looked for blinking icons or red dots, and perhaps a scrolling log window with so much crud in it that that you miss the important stuff anyway.

Somewhere along the line, some vendors moved the monitoring application interface to some form of HTTP like interface, perhaps with a big ugly blob of Java or an ActiveX control or two, and made is possible to look at status, availability and performance information from an ordinary desktop. And perhaps, if the vendor had a bold vision of the future, they may have even made it possible for more than one client to view the same information at the same time, and maybe even from an ordinary browser.

All of that works, or worked, but none of it solved my problem. System managers need to know what interesting things happened in the last few hours, or better yet, what interesting things have happened since the last time they checked for interesting things. I'm tired of having to access a dedicated application monitoring interface just to make a quick check of system status.

To see if there are better, simpler methods for checking system status, I prototyped a system that automatically creates an RSS feed for each hosted application. The concept is simple. Slurp up interesting status information, like response time, CPU percent, I/O per second, etc. Then organize the information in some reasonably logical format and present it as an RSS or atom news feed.

Here's roughly how it looks:

  • One RSS or atom feed for each application.
  • One article per host or device in the applications dependency tree
  • With a primitive dump of host status as CDATA in the article body
  • With a link to the host status page as the article title.
  • Update the feeds every minute or so.
  • Re-publish the host/device individual article any time the host status changes, using the status change time as the article publish time.
So how does it work? From what I can tell, the news readers aren't really set up for real time monitoring. The minimum feed refresh time tends to be around 10 minutes, which for a real time application is pretty much a lifetime. But, for non-real time status, or for recent historical status (like the last few hours) it seems to work pretty good.

The key seems to be to only update the article that corresponds to a host or device when the device status changes. That allows the news feed readers to bubble up to the top any status changes, even if the status changed from good to bad and back to good since the last time you viewed the feed. The reader sees the article (device) as re-published, so it presents it as a new article. The act of marking the article as read removes it or unhighlights it in the reader, effectively backgrounding it until the next time the device status changes. When the status changes, the reader sees the article as recently published and highlights it accordingly.

The reader has to be smart enough to drop off or un-highlight 'read' articles. If you know that server16 had slow response time a hour ago, you need to be able to mark the article as 'read', effectively suppressing that information until its status changes and the article gets republished.

The title of the device's article can be suitably mangled to present Good/Bad/Ugly status, so readers see the important information without opening the article, and the article body can contain the details of why a device has or had a particular status and appropriate timestamps for status changes. The GUID in RSS 2.0 has to uniquely identify the host or device, so the reader can accurately track the associated article.

So far, it works.

Comments

Popular posts from this blog

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:Failing to understand the difference between necessary and sufficient. A daily backup is necessary, b…

Ad-Hoc Verses Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to …

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:
In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. The…