Using RSS for System Status

It's time to re-think how status and availability information gets communicated to system managers. System status is sort of like news, so why not use a news reader? I don't mean using RSS for posting blog-like information to users like Our systems were borked last night from 03:00 to 05:00. That is a great idea, and many people already do a fine job in that space. I mean something more like "server16 HTTP response time greater than 150ms". The nerdy stuff.

Most IT organizations have monitoring systems that attempt to tell us the status of individual system components like servers, databases, routers, etc. The general idea is to make what we think are relevant measurements related to the performance or availability of components, record the measurements and hopefully alert on measurements that fall outside of some boundaries. Some places attempt to aggregate that information into some kind of higher level information that depicts the status of an application or group of related applications or networks.

That status information used to be presented on anything from dedicated X-windows sessions on high end workstations to simple Windows application interfaces. If you wanted to see what was happening, you logged into an application or console of some sort and looked for blinking icons or red dots, and perhaps a scrolling log window with so much crud in it that that you miss the important stuff anyway.

Somewhere along the line, some vendors moved the monitoring application interface to some form of HTTP like interface, perhaps with a big ugly blob of Java or an ActiveX control or two, and made is possible to look at status, availability and performance information from an ordinary desktop. And perhaps, if the vendor had a bold vision of the future, they may have even made it possible for more than one client to view the same information at the same time, and maybe even from an ordinary browser.

All of that works, or worked, but none of it solved my problem. System managers need to know what interesting things happened in the last few hours, or better yet, what interesting things have happened since the last time they checked for interesting things. I'm tired of having to access a dedicated application monitoring interface just to make a quick check of system status.

To see if there are better, simpler methods for checking system status, I prototyped a system that automatically creates an RSS feed for each hosted application. The concept is simple. Slurp up interesting status information, like response time, CPU percent, I/O per second, etc. Then organize the information in some reasonably logical format and present it as an RSS or atom news feed.

Here's roughly how it looks:

One RSS or atom feed for each application.
One article per host or device in the applications dependency tree
With a primitive dump of host status as CDATA in the article body
With a link to the host status page as the article title.
Update the feeds every minute or so.
Re-publish the host/device individual article any time the host status changes, using the status change time as the article publish time.

So how does it work? From what I can tell, the news readers aren't really set up for real time monitoring. The minimum feed refresh time tends to be around 10 minutes, which for a real time application is pretty much a lifetime. But, for non-real time status, or for recent historical status (like the last few hours) it seems to work pretty good.

The key seems to be to only update the article that corresponds to a host or device when the device status changes. That allows the news feed readers to bubble up to the top any status changes, even if the status changed from good to bad and back to good since the last time you viewed the feed. The reader sees the article (device) as re-published, so it presents it as a new article. The act of marking the article as read removes it or unhighlights it in the reader, effectively backgrounding it until the next time the device status changes. When the status changes, the reader sees the article as recently published and highlights it accordingly.

The reader has to be smart enough to drop off or un-highlight 'read' articles. If you know that server16 had slow response time a hour ago, you need to be able to mark the article as 'read', effectively suppressing that information until its status changes and the article gets republished.

The title of the device's article can be suitably mangled to present Good/Bad/Ugly status, so readers see the important information without opening the article, and the article body can contain the details of why a device has or had a particular status and appropriate timestamps for status changes. The GUID in RSS 2.0 has to uniquely identify the host or device, so the reader can accurately track the associated article.

So far, it works.