As a follow on to:
- How to Do Application Logging Right by Anton Chuvakin and Gunnar Peterson
- Application Security Logging by Colin Watson
- and the Common Event Expression (CEE) Architecture Overview [PDF],
I have a few semi-random thoughts on application logging.
Things I like to see in logs are:
Machine parseable (yet human readable) format. I need to be able to write a regex that cleanly separates interesting messages and pipe them into sed/awk and extract critical fields from the messages. I typically use sed/awk/perl to strip out uninteresting parts of the message and sort/count pipe-to-Excel the rest of the fields. I also use logsurfer to catch real time events and alert interested parties. Even organizations with sophisticated tools still need to be able to parse the logs. Bonus points if all messages of a particular type have the same number of fields - or if variable word fields are at the end of the message.
Single line events. No XML. I'm not going to write a custom multi-line XML parser for every random app. Not a chance.
Date/Time stamp on every message. Really. I need to know, at least to the nearest second, when every line of a log was generated. Establishing time lines is critical to troubleshooting and forensics. Synchronized clocks and stamped messages are what makes time lines possible. I like it when the source system timestamps the log and the log catcher also timestamps the log. Then we know if a clock is off.
Rational message prioritization. I need to be able to reliably detect critical messages and do something with them. The ability to filter on some sort of priority is key. A simple regular expression should be able to extract interesting messages.
Unique identifiers for log message types. Cisco's firewalls generate syslogs with an identifier that uniquely identifies the log message type. No message are emitted from a firewall that are not uniquely identified and documented. (Don't open that link on a mobile connection - you'll use up your quota...). With PIX/ASA logs, I can write a regex to catch %ASA-4-411004, find all cases where network interfaces were administratively enabled, match them up with access logs & figure out who enabled the interface. Then I can check change management and figure out if it was part of a plan or part of a rootkit.
Sufficient information to link the logs from the application to upstream or downstream systems. I'd think that some form of user session identifier logging would be useful on most apps. User IP address is a candidate for an upstream identifier. Web, firewall and netflow logs can be correlated by IP & date/time. Web app session ID or similar perhaps would link the application logs together, and if stored along with a userid could link application activity to database and web server activity. Once the problem is narrowed down to a specific user, being able to track that users session would be very useful.
I once needed to link application activity to an IP address. I couldn't do it directly as the app didn’t log IP addresses. The only thing that saved me was that the particular activity was only possible through a given URL, and that URL had only been called from one IP address in the window surrounding the event. Had there been more than one instance of that URL in that time window, we would not have been able to correlate the event with an IP address. I sometimes am able to correlate application activity to an IP address (application, firewall & load balancer logs), the IP address to a MAC address (DHCP logs), the MAC address to a userid (AD domain logs), the MAC address to a physical computer (switch CAM tables), the physical computer to a person (security cameras).
This sort of information occasionally gets used by law enforcement as a part of various investigations, so the preference is to be able to link various log sources together unambiguously.
Logging of failures is as important as logging of successes. Failed privilege escalation attempts (for example) are always interesting. Any failure of that type is either a broken/misconfigured app, a hack attempt, or a sysadmin is ef’ing around on a production system. Firewall denies for packets source from inside the firewall are similar. If a server inside a data center is attempting to connect to a blocked port/IP on another system or subnet, something is wrong. Either the server/application is configured wrong, or the firewall is configured wrong, or the rootkit is making a reconnoiter pass through the data center (or sysadmins are ef’ing around on production systems).
Timeliness. Logs must leave the system in near real time. Logs that have been on a compromised system for more than a few minutes (or seconds?) after the compromise are presumed tainted.
Bonus points: Serialized numbering on log messages. That way I know if I'm missing any messages. If I am missing messages, either something is broke or the rootkit deleted some of my messages. Heck - I could even write a log catcher that logged something like 'Expecting message #86679514, received message #86679517, 3 messages missing'. (I have a Netflow collector that does that).
I’ve often run into DBA’s and application folks who are afraid to generate large volumes of logs. I’m not afraid to generate large volumes of logs, provided that the logs are light weight, clear, concise and readable. We currently log tens of thousands of firewall, DNS & netflow records each second. If they show up as light weight syslog-like events, they are not hard to handle. On our largest application the vendor has a separate database just for user activity logging. With a reasonable purge strategy in place, it’s still larger than the production database. That’s OK though. It’s good data, and it’s not a difficult database to maintain.
Update: Gunnar has a series of posts on application logging, A quote from the first post:
By climbing the stack and monitoring the application, you collect data located closer to the core enterprise assets like transactions, business logic, rules, and policies. This proximity to valuable assets make the application an ideal place to see and report on what is happening at the level of user and system behavior, which can (and does) establish patterns of good and bad behavior that can provide additional indications of attacks.