The quarter-million dollar query

What does a query cost? In one recent case, about a quarter-million dollars.

The story (suitably anonymized): User requirements resulted in a new feature added to a busy application. Call it Widget 16. Widget 16 was complex enough that the database side of the widget put a measurable load on the server. That alone wouldn't have been a problem, but Widget 16 turned out to be a useful widget. So useful in fact, that users clicked on Widget 16 far more often have anticipated. Complex query + lots of clicks = load problem. How big was the load problem? In this case, the cost of the query (in CPU seconds, logical reads, etc.) and the number of times the query was executed were both relevant. After measuring the per-widget load on the server and multiplying by the number of widget clicks per second, we figured that widget costs at least a quarter million dollars per year in hardware and database license costs. Especially database licensing costs.

That was an interesting number.

Obviously we want to use that number to help make the application better, faster and cheaper.

The developers - who are faced with having to balance impossible user requirements, short deadlines, long bug lists, and whiny hosting teams complaining about performance - likely will favor the former over the latter. We expect that to happen. Unfortunately if that happens too often, the hosting team is stuck with either poor performance or a large hardware and software bill. To properly prioritize the development work effort, some rational measurement must be made of the cost of re-working existing functionality to reduce load verses the value of using that same work effort to add user requested features. Calculating the cost of running a feature or widget makes the prioritization determination possible. In this case, the cost of running the query compared to the person-time required to design, code, test and deploy a solution made the decision to optimize the widget (or cache it) pretty easy to make.

DBA's already have the tools (Profiler, AWR) to determine the utilization of a feature as measured in CPU, Memory and I/O. Hosting managers have the dollar cost of the CPU's and I/O figured out. What the DBA's and managers need to do is merge the data, format it into something readable and feed it back to the developers. Closing the loop via feedback to developers is essential.

The relevant data may vary depending on your application, but the data that almost certainly will be interesting will include:

Number of executions per time period (second, minute, hour)
CPU cycles per execution
Logical and Physical I/O's per execution.

The rough approximation of CPU load on the database server will be # of executions x CPU cycles per execution. The I/O's per execution x number of executions will give you a rough estimate of SAN or disk load. Obviously you only have a finite number of CPU cycles & I/O's available each second, and those CPU's and related database licenses have dollar costs associated with them. The actual application CPU and I/O data, measured against the total CPU and I/O available from your system and the annual hardware and software cost of the system will give you an estimate of the overall cost in dollars to run the query.

Notice that I didn't mention query response time. From the point of view of the developer, response time is interesting. It's what the user will feel when they navigate the site, and it is easy to measure. From a capacity/load point of view however, response time itself doesn't indicate the back-end CPU & I/O cost to achieve the response time. If the query or stored procedure returned in 200ms, but during that 200ms it paralleled out across all your CPU's and used up all available CPU time, you'll obviously only be able to do a handful of those each second, and if you try to do more than a handful per second, your application isn't going to be happy. Or if in that 200ms, it used 200ms of CPU time, you'll only be able to execute a handful of that query per CPU each second. In other words, focusing only on response time isn't adequate because it doesn't relate back to load related bottlenecks on the application and database servers.

For those who haven't seen an AWR report, Oracle has an example here. An AWR report allows your DBA's and dev team to slice, dice sort and analyze the heck out of an application. For SQL server we built a system that runs periodic profiler traces, uploads the trace to a database, and dumps out reports similar to Oracle AWR's.

The bottom line: In order for application developers to successfully design and build efficient, scalable applications, they need to have comprehensive performance related data. They need to be able to 'see' all the way through from the web interface to the database. If they do not have data, they cannot build efficient, scalable applications.