Skip to main content

Naked Without Strip Charts

The strip chart. Can't live without it.

The classic strip chart is the MRTG network utilization graph. MRTG and its companion RRDTool have to rank as some of the most useful system and network administration software ever written. The world is full of interesting uses for RRDTool and MRTG.

As part of normal application, server and network monitoring we generate about 2500 strip charts every 5 minutes.

Here's examples of how we use them:

Long term trends

bandwidth-trend

Yep -the network load at this site follows an annual calendar, and appears to have grown quite a bit last fall. But then it leveled off this spring. The bandwidth management appliances must be doing their job.

Application load

load

Application load, as measured by HTTP hits/second, peaks on Mondays, declines dramatically on Saturday, and starts to ramp back up Sunday night. That's good to know. Sunday night is as almost busy as Friday afternoon. And of course this isn't a 'follow the sun' application. It's really only used in a single time zone.

Detecting problems

Does anyone feel like helping track down a connection pool leak?

tcpip_sessions-week

This is awful. TCP connections from an app server to database shouldn't saw tooth like that. We have 1700 open TCP sockets on a single server. Probably 100 of them are in use. The rest are stale, dead, hung or something.

Something has changed

RTT-yearly

Round trip time to get a dynamic HTTP page on this application more than doubled a few months ago. Presumably we could go back through a change log and determine what we might have done that caused the response time to change so dramatically. Lets see...the second week of March.....Hmmmm...

Detecting Anomalies

DOS-attack

That 500Mbps spike at the beginning of week 19? A denial of service attack, perhaps likely. At least we know that the routers on each end of that circuit can handle 500Mbps.

Reconstructing Events

We know that application response time was bad on Monday. Users told us. Lets dissect the event.

ApplicationRTT

Yep, it was bad. 3.0k ms = 3 seconds. You can see normal response time (RTT) is something closer to 250 ms. For us, 3 seconds is bad. For my bank, my cell phone company, 3 seconds about normal.

Let's see if it was related to web application load. Maybe we had unusually high user generated load.

hits

Nope - not user load related. Monday was just another day.

Lets check the back end database servers.

cpu

Dang - that's bad. The green server was buried. And yep - the times line up. From 10am to 4pm.

I wonder what process on the server was using up the cpu?

databasecpu

Look like to me like the red database was the culprit. Of course an Oracle AWR report will let us drill down into the period in question. (Notice also that the blue database has a half-hour periodic CPU spike. There probably is something in either crontab or the Oracle scheduler that will explain that, like perhaps a materialized view that someone needs refreshed twice per hour.)

Conclusion

Strip charts don't help much for up-to-the second performance or troubleshooting data. The operating system or database built in tools are much better for that. But for the types of uses outlined here, strip charts can't be beat.

MRTG and RRDtool are limited only by Perl, which has no limits.

Comments

  1. Michael,

    It looks like you guys are mostly using RRDtool. Have you switched to it entirely, or do you still have MRGT as well?

    I had to turn off my MRTG host. It was killing all the other processes o that machine. Unlike a lot of other RRDtool utilities, it makes the graphs when it gets the data, instead of on the fly, and that just killed the poor box.

    What reporting tool are you using? I'm torn between Cacti/Nagios and Zenoss.

    ReplyDelete
  2. We've got RRD tool and MRTG, with MRTG mostly being phased out.

    I've settled on generating all the charts at the time that we gather data, even with RRD. That makes the load predictable. If I use dynamically generated charts, I've got problems keeping up with the load when we are having server or network issues. (Everybody clicking on & refreshing the charts takes out the monitoring server.)

    The reporting is done with ksh, sed, awk and perl. The system mostly predates the packages you mention, and has extensive customization.

    Here's a few links:

    http://www.net.mnscu.edu/sysmon/status.html

    http://www.net.mnscu.edu/host/APPS/index.html

    http://www.net.mnscu.edu/app/d2l/

    http://www.net.mnscu.edu/net/netscaler/netscaler_metro/index.html

    http://www.net.mnscu.edu/net/netscaler/active.html

    ReplyDelete

Post a Comment

Popular posts from this blog

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:Failing to understand the difference between necessary and sufficient. A daily backup is necessary, b…

Ad-Hoc Verses Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to …

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:
In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. The…