Skip to main content

Unconstraining a constrained resource

When a technology that is normally constrained is presented with unlimited resources, what will it do?

We've had a couple of interesting examples of what happens when software that is normally constrained by memory has that constraint removed. The results were interesting, or amusing, depending on your point of view.

Too much memory

The first time we ran into something like this was when we installed a new SQL server cluster on IBM x460 hardware. We bought the server to resolve a serious database performance problem on an application that was growing faster than we could purchase and install new hardware. To get ahead the growth curve, we figured that we had to more than double the performance of the database server. And of course while we were at it, we'd cluster it also. And because HP wouldn't support Windows 2003 Datacenter Edition on an HP EVA, we ended up buying a an IBM DS4800 also. And because we had neither the time nor the resources to spend endless hours tuning the I/O, we'd get enough memory that the database working set was always in memory. And no, the application vendor wouldn't support any form of horizontal scalability whatsoever.

So after a month of research, a month to get a decision, a month to get through our purchasing process, and a month for IBM to build the server, we got two new x460 16-CPU Xeon servers, a DS4800, all pre-racked and ready to boot. The servers hit the new datacenter floor at noon on Dec 24th. The application had to be moved to the new datacenter on the new servers and be ready for production on January 6th. We were stressed. The schedule was something like Day 1: build severs. Day 2: build SAN. Day 3 cluster and failure testing (and so on ....)

When we got to 'Day4: Install SQL server 2000' we ran into a pretty significant issue. It wouldn't run. SQL server 2000 simply would not start. It gave a cryptic error message and stopped. Microsoft Tier 3 SQL support to the rescue. They never heard of THAT error message. Escalate...wait...escalate...wait...bring IBM Advanced Support into the conference there are two of us and 6 of them on the call...finally:
'We've got a bug ID on SQL server 2005, that sounds similar. But it only occurs when you run a 32 bit database on a server with more than 64GB of memory.'
Hmm.... we are on 32 bit 2000, not 2005, but we do have 128GB of memory, so maybe?

The workaround suggested was to remove memory. But now it is late on a Friday, and we are going live next Wednesday, and spending hours carefully removing memory sounded like a headache. Fortunately /burnmem worked, the server only saw 64GB of memory, and SQL2000 was happy to run, though slightly confused. The cut over was successful, and we survived spring semester start up with a database that ran 16 CPU's at 70% busy on our peak day (instead of an 8 CPU database server at 140% CPU).

It probably never occurred to the SQL server developers, back when the database memory model was designed and database boot/init/load code was written, that a customer would try to run a 32-bit database with 128GB of memory. No way. What did the software do? It croaked. It rolled over and died. It didn't even try.

(That server lasted a semester. The application utilization continued to grow, and the two 16 CPU servers became 16 dual cores @ 50% utilization before the start of the next semester.)

Too much memory - Part 2

Fast forward (or move the slider bar on your OS X Time Machine) a year and a half. The app is still growing. A couple rounds of tuning, vendor code re-writes, and a couple of Microsoft engineering on-sites tells us that we really, really want to be running this thing with all 64 bits & SQL server 2005.

So we take the plunge. Windows 2003 Datacenter, x64, this time with SQL server 2005, all 64 bits, and all 128GB of memory. Life is good, right?

Not quite.

An interesting question: What does SQL Server 2005 64bit do when you feed it an application that forces it to parse, compile and cache hundreds of new queries every second, when the queries are such that they can't be parameterized? It parses, complies and caches all of them. As long as the procedure cache is constrained, all you get is an inefficient database sever and a high cache curn rate. But it works. When there is no constraint? It's procedure cache gets too big, and it gets really, really unhappy. And so do the hundred thousand students who thought they were going to use the application that day.

As best as we can figure, based on a 13 hour tech support call and handful of PSSDiags & such, we fed new, unparameterizable queries into the database engine, it cached them, and when the procedure cache got up somewhere around 8GB, the server pretty much spent all its cycles mucking around with managing its cache at the expense of doing useful database-like work. The cache never got trimmed or purged, probably because whatever process does that didn't see a reason to expire old cache entries. There was plenty of memory.

Fix the app, patch the database servers, and monitor the procedure cache, and all is well.

When a technology that is normally constrained is presented with unlimited resources, what will it do?

It'll probably croak.


Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …