Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. 

Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to improved system performance and availability. I currently place sound fundamental system management ahead of redundancy in the path to increased availability. In other words, a poorly managed redundant high availability system will have lower availability that a well managed non-redundant system.

This structure doesn't need to be a full ITIL framework that a million dollars worth of consultants dropped in your lap, but it has to exist, even if only in simple, straightforward wiki or paper based systems.

You know you have structured system management when:

You manage with scripts, not mouse clicks. Management by mouse clicking around is certain to introduce human error into a process that needs to be error free. Configuring a dozen servers or switches with a GUI, where each config item takes a handful of mouse clicks, will inevitably introduce inconsistency into the configuration. All humans are human. All humans error. Even the best system manager will eventually error in any manual process. GUI's can be great for monitoring and troubleshooting, but for configuration - Use scripts, not clicks.

You manage consistently across platforms. The fundamentals of managing Windows, Unix, Linux, and other operating systems is essentially the same. Whatever processes you have for building, monitoring, managing, logging and auditing systems must be applied across all platforms. Databases are essentially all the same from the point of view of a DBA. Backup, recovery, logging, auditing, security can be consistent across database platforms, even if the details of implementation are not. In ad hoc system management, you have no fundamentals, or your various platforms are managed to different, unrelated standards, or to no standard at all.

You deploy servers from images, not install DVD's. Server installation from scratch is necessary for the first installation of a new major version of an operating system. That first installation gets documented, tested, and QA'd with a fine tooth comb. All your other servers of that platform get imaged from that master. And when you have a major upgrade cycle, you re-master your server golden images. I once took over the management of a platform where the twenty-odd servers clearly had been installed from whatever CD happened to be laying around, with what seemed to be random choices for the installation wizards. That wasn't fun at all.

You can re-install a server and its applications from your documentation. That implies that you understand where each server deviates from the golden image, it implies that you know what the application did to your server when you installed it, and it implies that you can re-create every change to the application since is was first installed. If the only way you can move the application to a new server is to tar or zip up an unknown, undocumented directory structure, or worse yet, you have to upgrade servers in place because you can't re-install the application and get it to work, you pretty much are in ad hoc land. This one is a tough one.

You install and configure to the least bit principle for all your devices, servers, operating systems and applications. (Read the post!)

You have full remote management of all your servers and devices. This means that your router and switch serial consoles are all connected to a console server, that your server 'lights out' boards are installed, licensed and working, that you have remote IP based KVM's on all servers that need a three fingered salute, and that your network people, when they toast your network, have out-of-band management sufficient to recover what they'd toasted. And it means that the people that need to get at the remote consoles can do it from home, in their sleep.

You build router, switch and firewall configurations from templates, not from scratch. Your network and firewall configs should be so consistent, that a 'diff' of any two configs will show a readable output. Network and firewall configuration is not a case where you want entropy.

You have version control and auditing on critical system and application configuration files. On the Windows platform, this is tough to do. On Unix'ish like operating systems, network, firewalls, load balancers, it is trivial to do. (CVS or SVN and a couple of scripts. It is really simple, and there is no excuse for not doing it). On databases, this means that you are using command line based tools and writing scripts to manage your databases, not using the friendly, shiny mouse-clicky, un-auditable GUI. You have version control and auditing when you can tell your security and forensics team exactly how a server, database or firewall was configured at 14:32 UTC on a Tuesday a year and a half ago. And you can demonstrate that with certainty that it how it was configured.

You automatically monitor, strip chart and alert on at least memory, network and disk I/O, and CPU. And you've identified and are charting at least a handful of other platform specific measurements. MRTG is your friend.

You automatically monitor, strip chart and alert on application response time and availability. Go get a web page, measure how long it takes, and strip chart it. Connect to your database, do a simple SELECT, measure how long it takes, and strip chart it.

You have a structure and process for your documentation, even if it is a directory on a shared drive, or better yet, a wiki. Your documentation starts out with a document on 'how to document'.

You have change management and change auditing sufficient to determine who/what/when/why on any change to any system or application critical file.

You have a patch strategy, consistently applied across applications and platforms. You know what patch & rev each device and server is at, and you have a minimal range of variations between patch revs on a platform. You understand your platform vendors recommendations, you know how well they regression test, and you know how much risk is associated with patching each platform. You've had the 'old and stable' vs. 'patch regularly for security' arguments, and you have picked a side. It doesn't matter which side. It matters that you've had the arguments.

You have neat cabling and racks. You cannot reliably touch a rack that is a mess. You'll break something that wasn't supposed to be down, you'll disconnect the wrong wire and generally wreak havoc on your production systems. Tie wraps, coiled cables, labeled cables, color coded cat 5, color coded power cables, and cable databases are simple, easy an free to implement. I'll bet a pitcher of brew that if your racks are a mess, so are your servers. Read this post from Data Center Design

You determine root cause of failures, outages and performance slowdowns more often than not. In the cases where you determine root cause, you take action to mitigate or eliminate future occurrences of that event. In cases where you do not determine root cause, you implement logging, diagnostics or measurements so that the next time the event occurs, you will have sufficient information to determine root cause. Every failure that is not tracked to a cause should result in changed or improved instrumentation or logging. Rebooting is NOT a valid troubleshooting technique. When you restart/reboot, you loose all possibility of gathering useful data that can be used to determine root cause and prevent future occurrences. If you are re-booting, you'd better have gotten a kernel dump to send to the vendor. Otherwise you wasted a re-boot and a root cause opportunity.

You have centralized logging, alerting, log rotation and basic log analysis tools. Your centralized logging can be as simple as a free SNMP collector, Snare, syslog, and a few perl scripts. Your log rotation, log archiving and log analysis tools can be the tools that come with your operating system, or even grep. Netflow and netflow collectors are free. Syslog is free.

You have installed, configured and are actually using your vendor provided platform management software. (HP SIM, IBM Director, etc.) Your servers and SAN's should be phoning home to HP or IBM when they are about to die, they should be sending you SMS's or traps, and you should be planning for a maintenance window to replace the DIMM or drive that your platforms predictive failure has alerted you on. You should be able to query your platform management database and determine versions, patches, CPU temperatures, and the price of coffee at Starbucks. You vendor provided you with a wonderful toolkit. Its free. Use it. 

Last rule:

Keep it simple at first. Once you've done it the simple way, and you know what you want, you can talk to vendors. Until then, stick to free, simple and open source. Stay away from expensive and difficult to implement tools until you've mastered the platform provided built in tools, free and open source tools.

--Mike

Update 2011-08-04: Tom Limoncelli has a great post on a similar topic: The Limoncelli Test: 32 Questions for Your Sysadmin Team.