Friday, March 14, 2008

Unconstraining a constrained resource

When a technology that is normally constrained is presented with unlimited resources, what will it do?

We've had a couple of interesting examples of what happens when software that is normally constrained by memory has that constraint removed. The results were interesting, or amusing, depending on your point of view.

Too much memory

The first time we ran into something like this was when we installed a new SQL server cluster on IBM x460 hardware. We bought the server to resolve a serious database performance problem on an application that was growing faster than we could purchase and install new hardware. To get ahead the growth curve, we figured that we had to more than double the performance of the database server. And of course while we were at it, we'd cluster it also. And because HP wouldn't support Windows 2003 Datacenter Edition on an HP EVA, we ended up buying a an IBM DS4800 also. And because we had neither the time nor the resources to spend endless hours tuning the I/O, we'd get enough memory that the database working set was always in memory. And no, the application vendor wouldn't support any form of horizontal scalability whatsoever.

So after a month of research, a month to get a decision, a month to get through our purchasing process, and a month for IBM to build the server, we got two new x460 16-CPU Xeon servers, a DS4800, all pre-racked and ready to boot. The servers hit the new datacenter floor at noon on Dec 24th. The application had to be moved to the new datacenter on the new servers and be ready for production on January 6th. We were stressed. The schedule was something like Day 1: build severs. Day 2: build SAN. Day 3 cluster and failure testing (and so on ....)

When we got to 'Day4: Install SQL server 2000' we ran into a pretty significant issue. It wouldn't run. SQL server 2000 simply would not start. It gave a cryptic error message and stopped. Microsoft Tier 3 SQL support to the rescue. They never heard of THAT error message. Escalate...wait...escalate...wait...bring IBM Advanced Support into the conference call...escalate..wait...now there are two of us and 6 of them on the call...finally:
'We've got a bug ID on SQL server 2005, that sounds similar. But it only occurs when you run a 32 bit database on a server with more than 64GB of memory.'
Hmm.... we are on 32 bit 2000, not 2005, but we do have 128GB of memory, so maybe?

The workaround suggested was to remove memory. But now it is late on a Friday, and we are going live next Wednesday, and spending hours carefully removing memory sounded like a headache. Fortunately /burnmem worked, the server only saw 64GB of memory, and SQL2000 was happy to run, though slightly confused. The cut over was successful, and we survived spring semester start up with a database that ran 16 CPU's at 70% busy on our peak day (instead of an 8 CPU database server at 140% CPU).

It probably never occurred to the SQL server developers, back when the database memory model was designed and database boot/init/load code was written, that a customer would try to run a 32-bit database with 128GB of memory. No way. What did the software do? It croaked. It rolled over and died. It didn't even try.

(That server lasted a semester. The application utilization continued to grow, and the two 16 CPU servers became 16 dual cores @ 50% utilization before the start of the next semester.)

Too much memory - Part 2

Fast forward (or move the slider bar on your OS X Time Machine) a year and a half. The app is still growing. A couple rounds of tuning, vendor code re-writes, and a couple of Microsoft engineering on-sites tells us that we really, really want to be running this thing with all 64 bits & SQL server 2005.

So we take the plunge. Windows 2003 Datacenter, x64, this time with SQL server 2005, all 64 bits, and all 128GB of memory. Life is good, right?

Not quite.

An interesting question: What does SQL Server 2005 64bit do when you feed it an application that forces it to parse, compile and cache hundreds of new queries every second, when the queries are such that they can't be parameterized? It parses, complies and caches all of them. As long as the procedure cache is constrained, all you get is an inefficient database sever and a high cache curn rate. But it works. When there is no constraint? It's procedure cache gets too big, and it gets really, really unhappy. And so do the hundred thousand students who thought they were going to use the application that day.

As best as we can figure, based on a 13 hour tech support call and handful of PSSDiags & such, we fed new, unparameterizable queries into the database engine, it cached them, and when the procedure cache got up somewhere around 8GB, the server pretty much spent all its cycles mucking around with managing its cache at the expense of doing useful database-like work. The cache never got trimmed or purged, probably because whatever process does that didn't see a reason to expire old cache entries. There was plenty of memory.

Fix the app, patch the database servers, and monitor the procedure cache, and all is well.

When a technology that is normally constrained is presented with unlimited resources, what will it do?

It'll probably croak.

No comments:

Post a Comment