Skip to main content

Thoughts on Application Logging

As a follow on to:

I have a few semi-random thoughts on application logging.

Things I like to see in logs are:

Machine parseable (yet human readable) format. I need to be able to write a regex that cleanly separates interesting messages and pipe them into sed/awk and extract critical fields from the messages. I typically use sed/awk/perl to strip out uninteresting parts of the message and sort/count pipe-to-Excel the rest of the fields. I also use logsurfer to catch real time events and alert interested parties. Even organizations with sophisticated tools still need to be able to parse the logs. Bonus points if all messages of a particular type have the same number of fields - or if variable word fields are at the end of the message.

Single line events. No XML. I'm not going to write a custom multi-line XML parser for every random app. Not a chance.

Date/Time stamp on every message. Really. I need to know, at least to the nearest second, when every line of a log was generated. Establishing time lines is critical to troubleshooting and forensics. Synchronized clocks and stamped messages are what makes time lines possible. I like it when the source system timestamps the log and the log catcher also timestamps the log. Then we know if a clock is off.

Rational message prioritization. I need to be able to reliably detect critical messages and do something with them. The ability to filter on some sort of priority is key. A simple regular expression should be able to extract interesting messages.

Unique identifiers for log message types. Cisco's firewalls generate syslogs with an identifier that uniquely identifies the log message type. No message are emitted from a firewall that are not uniquely identified and documented. (Don't open that link on a mobile connection - you'll use up your quota...). With PIX/ASA logs, I can write a regex to catch %ASA-4-411004, find all cases where network interfaces were administratively enabled, match them up with access logs & figure out who enabled the interface. Then I can check change management and figure out if it was part of a plan or part of a rootkit.

Sufficient information to link the logs from the application to upstream or downstream systems. I'd think that some form of user session identifier logging would be useful on most apps. User IP address is a candidate for an upstream identifier. Web, firewall and netflow logs can be correlated by IP & date/time. Web app session ID or similar perhaps would link the application logs together, and if stored along with a userid could link application activity to database and web server activity. Once the problem is narrowed down to a specific user, being able to track that users session would be very useful.

I once needed to link application activity to an IP address. I couldn't do it directly as the app didn’t log IP addresses. The only thing that saved me was that the particular activity was only possible through a given URL, and that URL had only been called from one IP address in the window surrounding the event. Had there been more than one instance of that URL in that time window, we would not have been able to correlate the event with an IP address. I sometimes am able to correlate application activity to an IP address (application, firewall & load balancer logs), the IP address to a MAC address (DHCP logs), the MAC address to a userid (AD domain logs), the MAC address to a physical computer (switch CAM tables), the physical computer to a person (security cameras).

This sort of information occasionally gets used by law enforcement as a part of various investigations, so the preference is to be able to link various log sources together unambiguously.

Logging of failures is as important as logging of successes. Failed privilege escalation attempts (for example) are always interesting. Any failure of that type is either a broken/misconfigured app, a hack attempt, or a sysadmin is ef’ing around on a production system. Firewall denies for packets source from inside the firewall are similar. If a server inside a data center is attempting to connect to a blocked port/IP on another system or subnet, something is wrong. Either the server/application is configured wrong, or the firewall is configured wrong, or the rootkit is making a reconnoiter pass through the data center (or sysadmins are ef’ing around on production systems).

Timeliness. Logs must leave the system in near real time. Logs that have been on a compromised system for more than a few minutes (or seconds?) after the compromise are presumed tainted.

Bonus points: Serialized numbering on log messages. That way I know if I'm missing any messages. If I am missing messages, either something is broke or the rootkit deleted some of my messages. Heck - I could even write a log catcher that logged something like 'Expecting message #86679514, received message #86679517, 3 messages missing'. (I have a Netflow collector that does that).

I’ve often  run into DBA’s and application folks who are afraid to generate large volumes of logs. I’m not afraid to generate large volumes of logs, provided that the logs are light weight, clear, concise and readable.  We currently log tens of thousands of firewall, DNS & netflow records each second. If they show up as light weight syslog-like events, they are not hard to handle. On our largest application the vendor has a separate database just for user activity logging. With a reasonable purge strategy in place, it’s still larger than the production database. That’s OK though. It’s good data, and it’s not a difficult database to maintain.

Update: Gunnar has a series of posts on application logging, A quote from the first post:

By climbing the stack and monitoring the application, you collect data located closer to the core enterprise assets like transactions, business logic, rules, and policies. This proximity to valuable assets make the application an ideal place to see and report on what is happening at the level of user and system behavior, which can (and does) establish patterns of good and bad behavior that can provide additional indications of attacks.


Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …