Thirty-Four Years in IT - Why not Thirty-Five?

After I was sidelined (Part 10) we had another leadership turnover. This time the turnover was welcome. I ended up in a leadership position under a new CIO. This allowed me to take advantage of some topics that I studied while I was sidelined. My new team took on a couple of challenges. (1) Introducing cloud computing to the organization, and (2) attempting to add a bit of architectural discipline to the development and infrastructure teams and processes. The first was somewhat successful, the later was not.

Cloud

I had been slowly working to get a master agreement with Amazon - a long, slow process when you are a public sector agency. When our new CIO mentioned 'cloud' I did a bit of digging and found out that Microsoft had added the phrase 'and Azure' to our master licensing agreement. Microsoft's foresight saved me months of contract negotiations. They made it trivial to set up an enterprise Azure account. So Azure became our default 'cloud'.

I had been running the typical nerdville home servers. Moving them from in-house Mac's to Linux in Azure was trivial - a weekend of messing around. I affirmed to our CIO that we had a fair number of apps that could be hosted in IaaS, and picked a couple of crash-test dummy apps for early migration. 

Myself and one of my staff spent a few months creating and destroying various assets in Azure, and came to the conclusion that the barriers to cloud adoption would be found mostly in our own staff, and not the technology stack. Infrastructure staff would have to re-think their jobs and their roles in the organization, and development staff would have to re-think application design. Both would challenge the organization. 

I also did a few quick-and-dirty demonstrations to get some ideas on how we might architect an enterprise framework for moving to Azure - such as hiding an Azure instance behind a firewall in our test  lab to show that we could create virtual data centers that appeared to be in our RFC-1918 address space, but were actually in Azure IaaS. We also presented quite a bit of what we learned to our campus IT staff at various events and get-togethers, hoping to build a bit of momentum at the campuses. 

On the down side, I ran into significant barriers within our own managers and their staff. A quorum of managers and staff were cloud-adverse and/or firmly committed to technologies and vendors that had no cloud play. We had to fight FuD from within.

Architecture

The Architecture activity was not successful. We had been running 'seat-of-the-pants' for years, resulting in many ad-hoc and orphaned tools, technologies and languages, and we were thinly staffed. So the idea that by adding rigor and overhead up front we'd end up with better technology that was less work to maintain was not well accepted. The entire concept of design first, then build was a tough sell, as the norm had been to start building first and figure out the design on the fly (if at all). Modern architectures such as presenting and API to our campuses were rejected outright. And of course the idea that two development teams or two infrastructure workgroups would agree on a tool, language, library - much less an architecture, was an even tougher sell.

The team (and any semblance of a formal architecture) was disbanded through attrition, and the body of standards, guidelines, processes, and practices are no doubt still in a SharePoint site, unmaintained and unloved. 

Why did I leave when I did?

As time went on, I found myself in fundamental disagreement with how the organization treated its people. Leadership was making personnel decisions that I could not support, that caused the loss of several of our best people, and that placed other staff in places where they could not succeed or by happy.

That leadership would move staff into positions in which they had no interest, and do it without the concurrance of their manager (me) was unacceptable. To pile on work that was outside the core skillset of an employee, and then try to destroy their career when they were failing, is unacceptable. I don't want to work for an organization like that, and because of financial decisions I made years ago I do not have to work for an organization like that. 

I did the math, got my ducks in a row, and retired. 

My only regret is that I was unable to influence the disposition of the staff that I left behind. 

Previous: Part 10 Leadership Chaos, Career derailed

Thirty-four Years in IT - Leadership Chaos, Career Derailed (Part 10)

This post is the hardest one to write. I've been thinking about it for years without being able to put words to paper. With the COVID-19 stay-at-home directive, I can't procrastinate anymore, so here goes.

As outlined in Part 9, Fall 2011 was a tough period. To make it tougher, the CIO decided to hire two new leadership-level positions - a new CISO over the security group, and a new Associate Vice Chancellor (AVC) over the Infrastructure group. The infrastructure AVC would be my new boss.

The CISO position was really interesting to me. The infrastructure position was not as interesting, as it would have been more of the same but with more stress and more headaches. I applied, was interviewed and rejected for both. I'm sure that part of the problem was that with the chaos of our poorly written ERP application and the Oracle database issues that Fall, I really didn't prepare for either interview. Not having interviewed for a job in more than a decade didn't help either.

Both hires ended up being bad for the organization and for my career. I'm pretty sure that both knew that I had been a candidate for the positions and both were threatened by me.

The new CISO was determined to sideline me and break down the close cooperation between my team and the security team. Whereas we had been working together for years, the security team was now restricted from communicating with me without the new CISO's permission. I was blackballed - cut out of all security related incidents, conversations, and meetings. Anything that had my fingerprints on it was trashed, either literally or figuratively. Staff who had worked closely with me in the past were considered disloyal to him and were sidelined and harassed.

The new CISO also declared that we were 'too secure' and tried to get a consultant to write up a formal document to that effect. Whatever security related projects we had in the pipeline were killed off. Rigorous processes around firewall rules, server hardening and data center security were ignored. Security would no longer impact the ability to deploy technology.

The new Infrastructure AVC started out by pulling projects from me without telling me, meeting with my staff without me in the room and telling them I was 'doing it wrong'. Staff were still loyal to me and kept me informed as to what was transpiring. It was clear that I was viewed as a threat and was not welcome.

I confronted my new boss and advised that if he were going to manage my staff without me in the room, that he might as well move them directly under him on the org chart. He had a bit of a shocked look on his face, and then obliged. I also advised that as I now had no staff and no role in the organization, he needed to find me something to do.

I knew that he'd have a hard time firing me - I was protected by Civil Service rules, but I also knew that my work environment would be poor until either he and I figured out how to work together or one of us left. My choice was to try to stick it out and make the best of it or move on. I probably had options either within the State University system or with the State of Minnesota. I really am a Higher Ed. guy though, so I was reluctant to move. I decided to wait it out - and meanwhile get my financial ducks in order and put out job feelers.

He responded by blackballing me from any conversation of significance, by trashing me in e-mails to colleagues, by making it clear to my former staff that I was not to have any work related conversation with them without him, and that referencing anything that I had said or done the last dozen years was unwelcome. At one point I had to advise my former staff that they should not be seen with me, as it might impact their relationship with the new bosses. In an effort to convince me to leave (or perhaps out of sympathy), he even called me into his office and showed me a job posting at another State agency that he thought might be interesting to me.

He also moved me out of the IT area and across the hall into finance, where I would not be available to my former staff (and where I made a couple of great friends).

The environment was chaotic and toxic. Teams got rearranged and disrupted with no clear idea why or what outcome was expected. Moral was poor, tempers were high. A new director/manager was hired into my old position who was extremely toxic. As one could predict, some of our best staff left and others lost enthusiasm and dedication. I ended up fielding requests to be a job reference for many of my former staff.

After about six months he and I smoothed things out to the point where we could work together, as long as I stayed away from his (my former) staff and offered no thoughts on anything he was doing to anyone other than him. I had no clear responsibilities and as long as I stayed out of his sandbox I could do pretty much whatever I wanted. So I used that time to re-think quite a bit of what I had been doing, and in particular to lay groundwork for work that paid off a few years down the road, work that I'm quite proud of and will write about at a later date.

After about a year and a half we had another CIO change and both the CISO and AVC left. Ironically, on the AVC's last day I was the one who helped him clear his office, walked him out to the parking ramp and saw him off.

About that time the 'toxic' director also left. A couple of us who were black sheep ended up back in the thick of things when we found out how badly our technology and security had degraded. That's also when we found out that the 'completed' plans for moving a data center two months out did not exist.

The nightmare was over, but much damage had been done.

In retrospect, should I have left the organization? I'm not sure. For me it was very difficult to watch what myself and my team had built over the last fifteen years get torn apart, especially when what came out of the teardown was what I believed to be inferior to what we had. If what resulted was an improvement, it would have been easy. Very little of the technology that ran the system was built by anyone other than us. My fingerprint was on everything - good or bad, right or wrong. And to the new CISO and AVC, everything was bad and wrong.

 But the data center got moved on time.

Part 9 - The Application that Almost Broke Me

Thirty-four years in IT - The Application That Almost Broke Me (Part 9)

The last half of 2011 was for me an my team a really, really tough time.

As I hinted to in this post, by August 2011 we were buried in Oracle 11 & application performance problems. By the time we were back into a period of relative stability that December, we had:

  • Six Oracle Sev 1's open at once, the longest open for months. The six incidents were updated a combined total of 800 times before they finally were all resolved. 
  • Multiple extended database outages, most during peak activity at the beginning of the semester. 
  • Multiple 24-hour+ Oracle support calls.
  • An on-site Oracle engineer.
  • A corrupt on-disk database forcing a point-in-time recovery from backups of our student primary records/finance/payroll database.
  • Extended work hours and database patches and configuration changes more weekends than not.
  • A forced re-write of major sections of the application to mitigate extremely poor design choices.
The causes were several:
  1. Our applications, in order to work around old RDB bugs, was deliberately coded with literal strings in queries instead of passing variables as parameters. 
  2. The application also carried large amounts of legacy code that scanned large, multi-million row database tables one row at a time, selecting each row in turn and performing operations on that row. Just like in the days of Hollerith cards. 
  3. The combination of literals and single-row queries resulted in the Oracle SGA shared pool becoming overrun with simple queries, each used only once, cached, and then discarded. At times we were hard-parsing many thousands of queries per second, each with a literal string in the query, and each referenced and executed exactly once. 
  4. A database engine that mutexed itself to death while trying to parse, insert and expire those queries from the SGA library cache.
  5. Listener crashes that caused the app - lacking basic error handling - to fail and required an hour or so to recover.
Also:
  1. We missed one required Solaris patch that may have impacted the database.
  2. We likely were overrunning the interrupts and network stack on the E25k network cards and/or Solaris 10 drivers as we performed many thousands of trivial queries per second. This may have been the cause of our frequent listener crashes.
None of this was obvious from AWR's, and it was only after several outages and after we built tools to query the SGA that we saw where the problem might be. What finally got us going in a good direction was seeing a library cache with a few hundred thousand of queries like this:

select from student where student_id - '9876543';
select from student where student_id - '4982746';
select from student where student_id - '4890032';
select from student where student_id - '4566621';
[...]

Our app killed the database - primarily because of poor application design, but also because of Oracle bugs. 

An analysis of the issue by and Oracle engineer, from one of the SR's:
... we have also identified another serious issue that is stemming from your application design using literals and is also a huge contributor to the fragmentation issues. There is one sql that is the same but only differs with literals and had 67,629 different versions in the shared pool.
Along with the poor application design, we also hit a handful of mutex-related bugs specific to 11.2.0.x that were related to applications with our particular design. We patched those as soon as we could. We also figured out that network cards on SPARC E25k's can only do about 50,000 interrupts per second, and that adding more network cards would finally resolve some of the issues we were having with the database listeners.

Pythian has a good description of a similar issue - which had it been written a year earlier, would have saved us a lot of pain. 

Why didn't this happen on Oracle 10? 

I suspect that in Oracle 10, the SGA size was physically limited and that the database engine just simple churned through literal queries, hard-parsed them, tossed them out of memory, and drove up the CPU. But it never ran into mutex issues. It was in 'hard-parse-hell' but other than high CPU, worked OK. In Oracle 11, the SGA must have ben significantly re-written, as it was clear that the SGA was allowed to grow very large in memory, which (by our analysis) resulted in many tens of thousands of queries in the SGA, being churned through at a rate of many thousands per second. 

Along the way we also discovered COBOL programs that our system admins had been complaining about for 15 years - such as the program that scanned millions of individual records in the person table, one at a time, looking for who needs to get paid this week. Never mind that they could have answered that question with a single query. And of course the program did this scan twenty-six times, once for each pay period in the last year - just in case an old timecard had been modified. 

Brutal.

I insisted that our developers re-code the worst parts of the application - arguing that any other fix would at best kick the can down the road. 

In any case, by the time we reached our next peak load at semester start January '12, enough had been fix that the database ran fine - probably better than ever.   

But it cost us dearly. We worked most weekends that fall to rushed changes/patches/re-configurations, one of my staff ended up in the hospital, and I aged 5 years in as many months. 

In my next post I'll outline the other significant events in 2011/2012, which altered my job and forced me to re-evaluate my career.