Thirty-four years in IT - The Application That Almost Broke Me (Part 9)

The last half of 2011 was for me an my team a really, really tough time.

As I hinted to in this post, by August 2011 we were buried in Oracle 11 & application performance problems. By the time we were back into a period of relative stability that December, we had:

  • Six Oracle Sev 1's open at once, the longest open for months. The six incidents were updated a combined total of 800 times before they finally were all resolved. 
  • Multiple extended database outages, most during peak activity at the beginning of the semester. 
  • Multiple 24-hour+ Oracle support calls.
  • An on-site Oracle engineer.
  • A corrupt on-disk database forcing a point-in-time recovery from backups of our student primary records/finance/payroll database.
  • Extended work hours and database patches and configuration changes more weekends than not.
  • A forced re-write of major sections of the application to mitigate extremely poor design choices.
The causes were several:
  1. Our applications, in order to work around old RDB bugs, was deliberately coded with literal strings in queries instead of passing variables as parameters. 
  2. The application also carried large amounts of legacy code that scanned large, multi-million row database tables one row at a time, selecting each row in turn and performing operations on that row. Just like in the days of Hollerith cards. 
  3. The combination of literals and single-row queries resulted in the Oracle SGA shared pool becoming overrun with simple queries, each used only once, cached, and then discarded. At times we were hard-parsing many thousands of queries per second, each with a literal string in the query, and each referenced and executed exactly once. 
  4. A database engine that mutexed itself to death while trying to parse, insert and expire those queries from the SGA library cache.
  5. Listener crashes that caused the app - lacking basic error handling - to fail and required an hour or so to recover.
Also:
  1. We missed one required Solaris patch that may have impacted the database.
  2. We likely were overrunning the interrupts and network stack on the E25k network cards and/or Solaris 10 drivers as we performed many thousands of trivial queries per second. This may have been the cause of our frequent listener crashes.
None of this was obvious from AWR's, and it was only after several outages and after we built tools to query the SGA that we saw where the problem might be. What finally got us going in a good direction was seeing a library cache with a few hundred thousand of queries like this:

select from student where student_id - '9876543';
select from student where student_id - '4982746';
select from student where student_id - '4890032';
select from student where student_id - '4566621';
[...]

Our app killed the database - primarily because of poor application design, but also because of Oracle bugs. 

An analysis of the issue by and Oracle engineer, from one of the SR's:
... we have also identified another serious issue that is stemming from your application design using literals and is also a huge contributor to the fragmentation issues. There is one sql that is the same but only differs with literals and had 67,629 different versions in the shared pool.
Along with the poor application design, we also hit a handful of mutex-related bugs specific to 11.2.0.x that were related to applications with our particular design. We patched those as soon as we could. We also figured out that network cards on SPARC E25k's can only do about 50,000 interrupts per second, and that adding more network cards would finally resolve some of the issues we were having with the database listeners.

Pythian has a good description of a similar issue - which had it been written a year earlier, would have saved us a lot of pain. 

Why didn't this happen on Oracle 10? 

I suspect that in Oracle 10, the SGA size was physically limited and that the database engine just simple churned through literal queries, hard-parsed them, tossed them out of memory, and drove up the CPU. But it never ran into mutex issues. It was in 'hard-parse-hell' but other than high CPU, worked OK. In Oracle 11, the SGA must have ben significantly re-written, as it was clear that the SGA was allowed to grow very large in memory, which (by our analysis) resulted in many tens of thousands of queries in the SGA, being churned through at a rate of many thousands per second. 

Along the way we also discovered COBOL programs that our system admins had been complaining about for 15 years - such as the program that scanned millions of individual records in the person table, one at a time, looking for who needs to get paid this week. Never mind that they could have answered that question with a single query. And of course the program did this scan twenty-six times, once for each pay period in the last year - just in case an old timecard had been modified. 

Brutal.

I insisted that our developers re-code the worst parts of the application - arguing that any other fix would at best kick the can down the road. 

In any case, by the time we reached our next peak load at semester start January '12, enough had been fix that the database ran fine - probably better than ever.   

But it cost us dearly. We worked most weekends that fall to rushed changes/patches/re-configurations, one of my staff ended up in the hospital, and I aged 5 years in as many months. 

In my next post I'll outline the other significant events in 2011/2012, which altered my job and forced me to re-evaluate my career. 

Thirty-four years in IT - Swimming with the Itanic (Part 8)

For historical reasons, we were a strong VMS shop. Before they imploded, Digital Equipment treated EDU's very kindly, offering extremely good pricing on software in exchange for hardware adoption. In essence, a college could get an unlimited right to use a whole suite of Digital Equipment software for a nominal annual fee, and Digital had a very complete software catalog. So starting in the early 1990's, our internally developed student records system (ERP) ended up on the VMS/VAX/RDB stack.

Digital imploded and got bought by Compaq, who got bought by HP, Somewhere along the line the RDB database line ended up at Oracle.

For most of our time on VMS & RDB we suffered from severe performance problems. Our failure in addressing the problems was two-fold - the infrastructure team didn't have good performance data to feed back to the developers, and the development team considered performance to be an infrastructure/hardware problem. This resulted in a series of frantic and extremely expensive scrambles to upgrade VAX/Alpha server hardware. It did not however, result in any significant effort to improve the application design.

Between 1993 and 2005, we cycled through each of:
  1. Standalone VAX 4000's
  2. Clustered AlphaServer 4100's
  3. Standalone AlphaServer GS140's
  4. Standalone AlphaServer GS160's
And of course mid-life upgrades to each platform. 

Each upgrade cost $millions in hardware, and each upgrade only solved performance problems for a brief period of time. The GS160's lasted the longest and performed the best, but at an extremely high cost. At no point in time did we drill deeply into application architecture and determine where the performance problems originated.

During that time frame we got advice from Gartner that suggested that moving from VMS to Unix was desirable, but moving from RDB to Oracle was critical, as they did not expect Oracle to live up to their support commitments for the RDB database product. So in 2009 we moved from 35 individual RDB databases spread across four GS160's, to one Oracle 10G database on a Sun Microsystems E25k, in a single, extremely well implemented weekend-long database migration, platform migration, and 35:1 database merger. Kudos to the development team for pulling that off.

Unfortunately we carried forward large parts of the poor application design and transferred the performance problems from RDB to Oracle. At time though, the DBA's were part of my team. I had a very good Oracle DBA and Unix sysadmin, both of whom were able to dig into performance problems and communicate back to developers. We were pretty good at detailing the performance problems and offering remedies and suggested design changes. 

Though performance slowly got better, the full impact of poor application design was yet to be felt.

As soon as the databases were combined and hosted on SPARC hardware, continuing with the GS160's made no sense. They were costing $600k/yr in hardware and software maintenance, now were significantly oversize, and were still running the dead-end OpenVMS operating system. This put us in a tough spot. The development team was focused on minimizing their commitment to any re-platforming and was only interested in a move from AlphaServer to Itanium. For me, Itanium (or Itanic, as I called it at the time) was a dead end, and our only move should be to Unix (Solaris). But because the cost to migrate to Itanic was much lower - the application would only have to be recompiled, not re-platformed - the Itanic advocates won the argument. We ended up purchasing Itanium blade servers at a 3-year cost roughly equal to 18 months of support on the GS160's.

By that time HP's support for OpenVMS had eroded badly. Support for Oracle clients, Java, and other commonly used software was poor or non-existent. That OpenVMS was dead was visible to all but the few for whom OpenVMS was a religious experience.

As we were bashing the decision around in 2009, I strongly suggested that if we purchased Itanium in we'd be on the dead-end OpenVMS platform for five more years. I was wrong. We were on Itanium AlphaServer blades and OpenVMS nine years, until 2018. The (only) good part of that decision was that the Itanium blade servers ran very well and were inexpensive to maintain. And as OpenVMS was pretty dead by then, we did not spend very much time on patches and upgrades, as few were forthcoming from HP.

This is a case where our reluctance to take on some short-term pain resulted in our having to maintain a dead-end obsolete system for many years. 

Thirty-four Years in IT - Addressing Application Security (Part 7)

In the 2008-2009 period, we finally started to seriously address application layer security in our development group.

By that time is was clear that the threat to hosted applications had moved up the stack, and that the center of gravity had shifted towards compromising the web applications rather that the hosting infrastructure. This meant that our applications, for which essentially no serious security related effort had been made, had to finally receive some attention. Our development teams were not tuned in to the security landscape and thus were paying scant attention to web application security. As our home-grown applications exposure to the Internet was mostly limited to simple, student facing functionality such as course registration and grading, the lack of attention was perceived as appropriate by all but a few of us infrastructure and security geeks.

In other words, the dev teams were at the unconscious/incompetent level of the Conscious Competence Matrix.