Last In - First Out: Thirty-four years in IT - The Application That Almost Broke Me (Part 9)

The last half of 2011 was for me an my team a really, really tough time.

As I hinted to in this post, by August 2011 we were buried in Oracle 11 & application performance problems. By the time we were back into a period of relative stability that December, we had:

Six Oracle Sev 1's open at once, the longest open for months. The six incidents were updated a combined total of 800 times before they finally were all resolved.
Multiple extended database outages, most during peak activity at the beginning of the semester.
Multiple 24-hour+ Oracle support calls.
An on-site Oracle engineer.
A corrupt on-disk database forcing a point-in-time recovery from backups of our student primary records/finance/payroll database.
Extended work hours and database patches and configuration changes more weekends than not.
A forced re-write of major sections of the application to mitigate extremely poor design choices.

The causes were several:

Our applications, in order to work around old RDB bugs, was deliberately coded with literal strings in queries instead of passing variables as parameters.
The application also carried large amounts of legacy code that scanned large, multi-million row database tables one row at a time, selecting each row in turn and performing operations on that row. Just like in the days of Hollerith cards.
The combination of literals and single-row queries resulted in the Oracle SGA shared pool becoming overrun with simple queries, each used only once, cached, and then discarded. At times we were hard-parsing many thousands of queries per second, each with a literal string in the query, and each referenced and executed exactly once.
A database engine that mutexed itself to death while trying to parse, insert and expire those queries from the SGA library cache.
Listener crashes that caused the app - lacking basic error handling - to fail and required an hour or so to recover.

Also:

We missed one required Solaris patch that may have impacted the database.
We likely were overrunning the interrupts and network stack on the E25k network cards and/or Solaris 10 drivers as we performed many thousands of trivial queries per second. This may have been the cause of our frequent listener crashes.

None of this was obvious from AWR's, and it was only after several outages and after we built tools to query the SGA that we saw where the problem might be. What finally got us going in a good direction was seeing a library cache with a few hundred thousand of queries like this:

select from student where student_id - '9876543';

select from student where student_id - '4982746';

select from student where student_id - '4890032';

select from student where student_id - '4566621';

[...]

Our app killed the database - primarily because of poor application design, but also because of Oracle bugs.

An analysis of the issue by and Oracle engineer, from one of the SR's:

... we have also identified another serious issue that is stemming from your application design using literals and is also a huge contributor to the fragmentation issues. There is one sql that is the same but only differs with literals and had 67,629 different versions in the shared pool.

Along with the poor application design, we also hit a handful of mutex-related bugs specific to 11.2.0.x that were related to applications with our particular design. We patched those as soon as we could. We also figured out that network cards on SPARC E25k's can only do about 50,000 interrupts per second, and that adding more network cards would finally resolve some of the issues we were having with the database listeners.

Pythian has a good description of a similar issue - which had it been written a year earlier, would have saved us a lot of pain.

Why didn't this happen on Oracle 10?

I suspect that in Oracle 10, the SGA size was physically limited and that the database engine just simple churned through literal queries, hard-parsed them, tossed them out of memory, and drove up the CPU. But it never ran into mutex issues. It was in 'hard-parse-hell' but other than high CPU, worked OK. In Oracle 11, the SGA must have ben significantly re-written, as it was clear that the SGA was allowed to grow very large in memory, which (by our analysis) resulted in many tens of thousands of queries in the SGA, being churned through at a rate of many thousands per second.

Along the way we also discovered COBOL programs that our system admins had been complaining about for 15 years - such as the program that scanned millions of individual records in the person table, one at a time, looking for who needs to get paid this week. Never mind that they could have answered that question with a single query. And of course the program did this scan twenty-six times, once for each pay period in the last year - just in case an old timecard had been modified.

Brutal.

I insisted that our developers re-code the worst parts of the application - arguing that any other fix would at best kick the can down the road.

In any case, by the time we reached our next peak load at semester start January '12, enough had been fix that the database ran fine - probably better than ever.

But it cost us dearly. We worked most weekends that fall to rushed changes/patches/re-configurations, one of my staff ended up in the hospital, and I aged 5 years in as many months.

In my next post I'll outline the other significant events in 2011/2012, which altered my job and forced me to re-evaluate my career.

Part 8 - Swimming with the Itanic
Part 10 - Leadership Chaos