Thirty-four Years in IT - Leadership Chaos, Career Derailed (Part 10)

This post is the hardest one to write. I've been thinking about it for years without being able to put words to paper. With the COVID-19 stay-at-home directive, I can't procrastinate anymore, so here goes.

As outlined in Part 9, Fall 2011 was a tough period. To make it tougher, the CIO decided to hire two new leadership-level positions - a new CISO over the security group, and a new Associate Vice Chancellor (AVC) over the Infrastructure group. The infrastructure AVC would be my new boss.

The CISO position was really interesting to me. The infrastructure position was not as interesting, as it would have been more of the same but with more stress and more headaches. I applied, was interviewed and rejected for both. I'm sure that part of the problem was that with the chaos of our poorly written ERP application and the Oracle database issues that Fall, I really didn't prepare for either interview. Not having interviewed for a job in more than a decade didn't help either.

Both hires ended up being bad for the organization and for my career. I'm pretty sure that both knew that I had been a candidate for the positions and both were threatened by me.

The new CISO was determined to sideline me and break down the close cooperation between my team and the security team. Whereas we had been working together for years, the security team was now restricted from communicating with me without the new CISO's permission. I was blackballed - cut out of all security related incidents, conversations, and meetings. Anything that had my fingerprints on it was trashed, either literally or figuratively. Staff who had worked closely with me in the past were considered disloyal to him and were sidelined and harassed.

The new CISO also declared that we were 'too secure' and tried to get a consultant to write up a formal document to that effect. Whatever security related projects we had in the pipeline were killed off. Rigorous processes around firewall rules, server hardening and data center security were ignored. Security would no longer impact the ability to deploy technology.

The new Infrastructure AVC started out by pulling projects from me without telling me, meeting with my staff without me in the room and telling them I was 'doing it wrong'. Staff were still loyal to me and kept me informed as to what was transpiring. It was clear that I was viewed as a threat and was not welcome.

I confronted my new boss and advised that if he were going to manage my staff without me in the room, that he might as well move them directly under him on the org chart. He had a bit of a shocked look on his face, and then obliged. I also advised that as I now had no staff and no role in the organization, he needed to find me something to do.

I knew that he'd have a hard time firing me - I was protected by Civil Service rules, but I also knew that my work environment would be poor until either he and I figured out how to work together or one of us left. My choice was to try to stick it out and make the best of it or move on. I probably had options either within the State University system or with the State of Minnesota. I really am a Higher Ed. guy though, so I was reluctant to move. I decided to wait it out - and meanwhile get my financial ducks in order and put out job feelers.

He responded by blackballing me from any conversation of significance, by trashing me in e-mails to colleagues, by making it clear to my former staff that I was not to have any work related conversation with them without him, and that referencing anything that I had said or done the last dozen years was unwelcome. At one point I had to advise my former staff that they should not be seen with me, as it might impact their relationship with the new bosses. In an effort to convince me to leave (or perhaps out of sympathy), he even called me into his office and showed me a job posting at another State agency that he thought might be interesting to me.

He also moved me out of the IT area and across the hall into finance, where I would not be available to my former staff (and where I made a couple of great friends).

The environment was chaotic and toxic. Teams got rearranged and disrupted with no clear idea why or what outcome was expected. Moral was poor, tempers were high. A new director/manager was hired into my old position who was extremely toxic. As one could predict, some of our best staff left and others lost enthusiasm and dedication. I ended up fielding requests to be a job reference for many of my former staff.

After about six months he and I smoothed things out to the point where we could work together, as long as I stayed away from his (my former) staff and offered no thoughts on anything he was doing to anyone other than him. I had no clear responsibilities and as long as I stayed out of his sandbox I could do pretty much whatever I wanted. So I used that time to re-think quite a bit of what I had been doing, and in particular to lay groundwork for work that paid off a few years down the road, work that I'm quite proud of and will write about at a later date.

After about a year and a half we had another CIO change and both the CISO and AVC left. Ironically, on the AVC's last day I was the one who helped him clear his office, walked him out to the parking ramp and saw him off.

About that time the 'toxic' director also left. A couple of us who were black sheep ended up back in the thick of things when we found out how badly our technology and security had degraded. That's also when we found out that the 'completed' plans for moving a data center two months out did not exist.

The nightmare was over, but much damage had been done.

In retrospect, should I have left the organization? I'm not sure. For me it was very difficult to watch what myself and my team had built over the last fifteen years get torn apart, especially when what came out of the teardown was what I believed to be inferior to what we had. If what resulted was an improvement, it would have been easy. Very little of the technology that ran the system was built by anyone other than us. My fingerprint was on everything - good or bad, right or wrong. And to the new CISO and AVC, everything was bad and wrong.

 But the data center got moved on time.

Part 9 - The Application that Almost Broke Me

Thirty-four years in IT - The Application That Almost Broke Me (Part 9)

The last half of 2011 was for me an my team a really, really tough time.

As I hinted to in this post, by August 2011 we were buried in Oracle 11 & application performance problems. By the time we were back into a period of relative stability that December, we had:

  • Six Oracle Sev 1's open at once, the longest open for months. The six incidents were updated a combined total of 800 times before they finally were all resolved. 
  • Multiple extended database outages, most during peak activity at the beginning of the semester. 
  • Multiple 24-hour+ Oracle support calls.
  • An on-site Oracle engineer.
  • A corrupt on-disk database forcing a point-in-time recovery from backups of our student primary records/finance/payroll database.
  • Extended work hours and database patches and configuration changes more weekends than not.
  • A forced re-write of major sections of the application to mitigate extremely poor design choices.
The causes were several:
  1. Our applications, in order to work around old RDB bugs, was deliberately coded with literal strings in queries instead of passing variables as parameters. 
  2. The application also carried large amounts of legacy code that scanned large, multi-million row database tables one row at a time, selecting each row in turn and performing operations on that row. Just like in the days of Hollerith cards. 
  3. The combination of literals and single-row queries resulted in the Oracle SGA shared pool becoming overrun with simple queries, each used only once, cached, and then discarded. At times we were hard-parsing many thousands of queries per second, each with a literal string in the query, and each referenced and executed exactly once. 
  4. A database engine that mutexed itself to death while trying to parse, insert and expire those queries from the SGA library cache.
  5. Listener crashes that caused the app - lacking basic error handling - to fail and required an hour or so to recover.
Also:
  1. We missed one required Solaris patch that may have impacted the database.
  2. We likely were overrunning the interrupts and network stack on the E25k network cards and/or Solaris 10 drivers as we performed many thousands of trivial queries per second. This may have been the cause of our frequent listener crashes.
None of this was obvious from AWR's, and it was only after several outages and after we built tools to query the SGA that we saw where the problem might be. What finally got us going in a good direction was seeing a library cache with a few hundred thousand of queries like this:

select from student where student_id - '9876543';
select from student where student_id - '4982746';
select from student where student_id - '4890032';
select from student where student_id - '4566621';
[...]

Our app killed the database - primarily because of poor application design, but also because of Oracle bugs. 

An analysis of the issue by and Oracle engineer, from one of the SR's:
... we have also identified another serious issue that is stemming from your application design using literals and is also a huge contributor to the fragmentation issues. There is one sql that is the same but only differs with literals and had 67,629 different versions in the shared pool.
Along with the poor application design, we also hit a handful of mutex-related bugs specific to 11.2.0.x that were related to applications with our particular design. We patched those as soon as we could. We also figured out that network cards on SPARC E25k's can only do about 50,000 interrupts per second, and that adding more network cards would finally resolve some of the issues we were having with the database listeners.

Pythian has a good description of a similar issue - which had it been written a year earlier, would have saved us a lot of pain. 

Why didn't this happen on Oracle 10? 

I suspect that in Oracle 10, the SGA size was physically limited and that the database engine just simple churned through literal queries, hard-parsed them, tossed them out of memory, and drove up the CPU. But it never ran into mutex issues. It was in 'hard-parse-hell' but other than high CPU, worked OK. In Oracle 11, the SGA must have ben significantly re-written, as it was clear that the SGA was allowed to grow very large in memory, which (by our analysis) resulted in many tens of thousands of queries in the SGA, being churned through at a rate of many thousands per second. 

Along the way we also discovered COBOL programs that our system admins had been complaining about for 15 years - such as the program that scanned millions of individual records in the person table, one at a time, looking for who needs to get paid this week. Never mind that they could have answered that question with a single query. And of course the program did this scan twenty-six times, once for each pay period in the last year - just in case an old timecard had been modified. 

Brutal.

I insisted that our developers re-code the worst parts of the application - arguing that any other fix would at best kick the can down the road. 

In any case, by the time we reached our next peak load at semester start January '12, enough had been fix that the database ran fine - probably better than ever.   

But it cost us dearly. We worked most weekends that fall to rushed changes/patches/re-configurations, one of my staff ended up in the hospital, and I aged 5 years in as many months. 

In my next post I'll outline the other significant events in 2011/2012, which altered my job and forced me to re-evaluate my career. 

Thirty-four years in IT - Swimming with the Itanic (Part 8)

For historical reasons, we were a strong VMS shop. Before they imploded, Digital Equipment treated EDU's very kindly, offering extremely good pricing on software in exchange for hardware adoption. In essence, a college could get an unlimited right to use a whole suite of Digital Equipment software for a nominal annual fee, and Digital had a very complete software catalog. So starting in the early 1990's, our internally developed student records system (ERP) ended up on the VMS/VAX/RDB stack.

Digital imploded and got bought by Compaq, who got bought by HP, Somewhere along the line the RDB database line ended up at Oracle.

For most of our time on VMS & RDB we suffered from severe performance problems. Our failure in addressing the problems was two-fold - the infrastructure team didn't have good performance data to feed back to the developers, and the development team considered performance to be an infrastructure/hardware problem. This resulted in a series of frantic and extremely expensive scrambles to upgrade VAX/Alpha server hardware. It did not however, result in any significant effort to improve the application design.

Between 1993 and 2005, we cycled through each of:
  1. Standalone VAX 4000's
  2. Clustered AlphaServer 4100's
  3. Standalone AlphaServer GS140's
  4. Standalone AlphaServer GS160's
And of course mid-life upgrades to each platform. 

Each upgrade cost $millions in hardware, and each upgrade only solved performance problems for a brief period of time. The GS160's lasted the longest and performed the best, but at an extremely high cost. At no point in time did we drill deeply into application architecture and determine where the performance problems originated.

During that time frame we got advice from Gartner that suggested that moving from VMS to Unix was desirable, but moving from RDB to Oracle was critical, as they did not expect Oracle to live up to their support commitments for the RDB database product. So in 2009 we moved from 35 individual RDB databases spread across four GS160's, to one Oracle 10G database on a Sun Microsystems E25k, in a single, extremely well implemented weekend-long database migration, platform migration, and 35:1 database merger. Kudos to the development team for pulling that off.

Unfortunately we carried forward large parts of the poor application design and transferred the performance problems from RDB to Oracle. At time though, the DBA's were part of my team. I had a very good Oracle DBA and Unix sysadmin, both of whom were able to dig into performance problems and communicate back to developers. We were pretty good at detailing the performance problems and offering remedies and suggested design changes. 

Though performance slowly got better, the full impact of poor application design was yet to be felt.

As soon as the databases were combined and hosted on SPARC hardware, continuing with the GS160's made no sense. They were costing $600k/yr in hardware and software maintenance, now were significantly oversize, and were still running the dead-end OpenVMS operating system. This put us in a tough spot. The development team was focused on minimizing their commitment to any re-platforming and was only interested in a move from AlphaServer to Itanium. For me, Itanium (or Itanic, as I called it at the time) was a dead end, and our only move should be to Unix (Solaris). But because the cost to migrate to Itanic was much lower - the application would only have to be recompiled, not re-platformed - the Itanic advocates won the argument. We ended up purchasing Itanium blade servers at a 3-year cost roughly equal to 18 months of support on the GS160's.

By that time HP's support for OpenVMS had eroded badly. Support for Oracle clients, Java, and other commonly used software was poor or non-existent. That OpenVMS was dead was visible to all but the few for whom OpenVMS was a religious experience.

As we were bashing the decision around in 2009, I strongly suggested that if we purchased Itanium in we'd be on the dead-end OpenVMS platform for five more years. I was wrong. We were on Itanium AlphaServer blades and OpenVMS nine years, until 2018. The (only) good part of that decision was that the Itanium blade servers ran very well and were inexpensive to maintain. And as OpenVMS was pretty dead by then, we did not spend very much time on patches and upgrades, as few were forthcoming from HP.

This is a case where our reluctance to take on some short-term pain resulted in our having to maintain a dead-end obsolete system for many years. 

Thirty-four Years in IT - Addressing Application Security (Part 7)

In the 2008-2009 period, we finally started to seriously address application layer security in our development group.

By that time is was clear that the threat to hosted applications had moved up the stack, and that the center of gravity had shifted towards compromising the web applications rather that the hosting infrastructure. This meant that our applications, for which essentially no serious security related effort had been made, had to finally receive some attention. Our development teams were not tuned in to the security landscape and thus were paying scant attention to web application security. As our home-grown applications exposure to the Internet was mostly limited to simple, student facing functionality such as course registration and grading, the lack of attention was perceived as appropriate by all but a few of us infrastructure and security geeks.

In other words, the dev teams were at the unconscious/incompetent level of the Conscious Competence Matrix.