Friday, May 16, 2008

The quarter-million dollar query

What does a query cost? In one recent case, about a quarter-million dollars.

The story (suitably anonymized): User requirements resulted in a new feature added to a busy application. Call it Widget 16. Widget 16 was complex enough that the database side of the widget put a measurable load on the server. That alone wouldn't have been a problem, but Widget 16 turned out to be a useful widget. So useful in fact, that users clicked on Widget 16 far more often have anticipated. Complex query + lots of clicks = load problem. How big was the load problem? In this case, the cost of the query (in CPU seconds, logical reads, etc.) and the number of times the query was executed were both relevant. After measuring the per-widget load on the server and multiplying by the number of widget clicks per second, we figured that widget cost at least a quarter million dollars per year in hardware and database license costs. Especially database licensing costs.

That was an interesting number.

Obviously we want to use that number to help make the application better, faster and cheaper.

The developers - who are faced with having to balance impossible user requirements, short deadlines, long bug lists, and whiny hosting teams complaining about performance - likely will favor the former over the latter. We expect that to happen. Unfortunately if that happens too often, the hosting team is stuck with either poor performance or a large hardware and software bill. To properly prioritize the development work effort, some rational measurement must be made of the cost of re-working existing functionality to reduce load verses the value of using that same work effort to add user requested features. Calculating the cost of running a feature or widget makes the prioritization determination possible. In this case, the cost of running the query compared to the person-time required to design, code, test and deploy a solution made the decision to optimize the widget (or cache it) pretty easy to make.

DBA's already have the tools (Profiler, AWR) to determine the utilization of a feature as measured in CPU, Memory and I/O. Hosting mangers have the dollar cost of the CPU's and I/O figured out. What the DBA's and managers need to do is merge the data, format it into something readable and feed it back to the developers. Closing the loop via feedback to developers is essential.

The relevant data may vary depending on your application, but the data that almost certainly will be interesting will include:

  • Number of executions per time period (second, minute, hour)
  • CPU cycles per execution
  • Logical and Physical I/O's per execution.

The rough approximation of CPU load on the database server will be # of executions x CPU cycles per execution. The I/O's per execution x number of executions will give you a rough estimate of SAN or disk load. Obviously you only have a finite number of CPU cycles & I/O's available each second, and those CPU's and related database licenses have dollar costs associated with them. The actual application CPU and I/O data, measured against the total CPU and I/O available from your system and the annual hardware and software cost of the system will give you an estimate of the overall cost in dollars to run the query.

Notice that I didn't mention query response time. From the point of view of the developer, response time is interesting. It's what the user will feel when they navigate the site, and it is easy to measure. From a capacity/load point of view however, response time itself doesn't indicate the back-end CPU & I/O cost to achieve the response time. If the query or stored procedure returned in 200ms, but during that 200ms it paralleled out across all your CPU's and used up all available CPU time, you'll obviously only be able to do a handful of those each second, and if you try to do more than a handful per second, your application isn't going to be happy. Or if in that 200ms, it used 200ms of CPU time, you'll only be able to execute a handful of that query per CPU each second. In other words, focusing only on response time isn't adequate because it doesn't relate back to load related bottlenecks on the application and database servers.

For those who haven't seen an AWR report, Oracle has an example here. An AWR report allows your DBA's and dev team to slice, dice sort and analyze the heck out of an application. For SQL server we built a system that runs periodic profiler traces, uploads the trace to a database, and dumps out reports similar to the Oracle AWR's.

The bottom line: In order for application developers to successfully design and build efficient, scalable applications, they need to have comprehensive performance related data. They need to be able to 'see' all the way through from the web interface to the database. If they do not have data, they cannot build efficient, scalable applications.

3 comments:

  1. Mike: i can relate to your story from a couple of significant changes in environments i have been at. imho, the challenge is the availability of relevant perf data at the time of the design of the new widget. If it is new and usage is not known then the only data available is for common queries [against the db] and some other common components [cache, web server etc,] - the rush to the finish line is one where the time necessary to model this in a more realistic way is not to be found. So in my experience on these stories, the "successful widget" launches have caused a bunch of operational distress to right the ship; and that is the norm.
    curious to know if you/others have some scaffold of a model that takes into account cpu usage, interface stats, query stats, memory/cache usage and spits out a compute-cost metric.
    Good post

    ReplyDelete
  2. Prasana -

    if you/others have some scaffold of a model

    We 'sort of' have a scaffold. We built a system stored procedures and scripts that allows individual developers to trigger a trace & automatically get the query plan e-mailed back to them a few minutes later. The trace is against a test database that has either the statistics from the production database or is a full copy of production. They invoke it from a dev instance of the application. So in theory, the person who wrote the widget knows CPU ticks, logical and physical I/O prior to production rollout.

    What they don't know, and can only estimate, is how often users will hit the widget. They make a best guess as to what they think users will do. Sometimes right, sometimes wrong. So as you indicated, this is often 'remedial' rather that proactive.

    As for a model - I only estimate numbers, as in - 'that server costs a million a year to run, and the widget is using a quarter of it'. All we are really interested is a rough estimate that we can use to help us decide if it is worth fixing the widget. For us +/-50% is close enough. The numbers are big enough that the answer tends to be obvious.

    ReplyDelete
  3. This is excellent! I will try and calculate a Cost per X for major cost areas and give this info to the developers. This should help them decide whether a feature should be implemented or needs to be rethought. Perhaps a small design change or a minor feature less can dramatically reduce the cost of a feature.

    Thanks!

    ReplyDelete