Building Non-Functional Requirements Framework - Requirements Categories

I'm planning on documenting a framework that we built for managing non-functional requirements. This is post #2 of the series. 

In Post #1, Last In - First Out: Building a Non-Functional Requirements Framework - Overview I outlined the template and definitions for our Non-Functional Requirements. 

We also had to address outstanding audit findings that pointed out the lack of enterprise-wide security standards. Blank templates weren't going to cut it. The next steps were to create a generic set of Non-Functional requirements within each category, applicable to any system that we'd likely encounter. We then followed up with a structured, objective framework for applying the requirements to a particular system. The next few posts will cover these topics.

To make the NFR's re-usable and applicable to as many systems as possible, we created multiple Metrics within each NFR. Systems for which requirements could be relatively simple would be required to meet a lower Metrics, while systems for which requirements needed to be higher/stricter would meet the higher Metrics in the NFR. The Metrics were designed so that the very lowest level would be applicable to a single personal computing device with no stored confidential data, the highest Metric would be applicable to our largest system with the most confidential or financial data, and the in-between Metrics would be applicable to systems of varying levels of security and availability requirements in between the extremes. This allowed us to create a single Requirement applicable to many (or any) systems proportional to their relative value, and without subjecting low value systems to rigorous requirements. 

Note that Availability, Performance and Reliability requirements found in other models are not requirements categories in our model. We determined that if a system met a set of Resiliency, Recoverability and Security requirements, the system would also meet an appropriate level of availability and reliability as a byproduct of the Resiliency, Recoverability and Security Requirements. Likewise, the system would be able to meet Performance requirements as a byproduct of scalability and maintainability requirements.

Usability, Portability and Compatibility are common requirement families in other models, but as the model was driven by short-term infrastructure and security needs, they were left out in the early phases

Keep in mind that these categories and requirements were designed to be usable in our environment - a public College and University system.

The categories and a high level description of the requirements in each category follow:

Category: Resiliency

Resiliency requirements describe the ability of the system to continue to function during common failure modes. A resilient system continues to work after routine failures (disk, server, OS or process). Resiliency is necessary to meet availability requirements and usability requirements. A resilient system may use technologies such as redundancy, clustering, load balancing, error handling, and error recovery to function after component failure. Resiliency encompasses the concepts of availability, reliability, robustness, fault tolerance and exception handling as described by other authors. 

Our model references three Resiliency requirements - Hardware Resiliency, Software Resiliency, and Environmental Resiliency. Each requirement may have multiple levels with each metric.

Resiliency-Hardware Requirement: The ability of the system to continue business functionality upon physical failure of hardware components that make up the system. 

Incorporates traditional concepts of Redundancy, Clustering, Load Balancing and Fault Tolerance. A systems 'Availability', RPO and RTO are derived from this and other requirements.

This requirement is intended to force the designer to leverage high availability technologies for systems in which the impact of an unavailable system reaches certain thresholds. 

Resiliency-Software Requirement: The ability of the system to continue business functionality upon logical failure of software components that make up the system.

Incorporates traditional concepts of Redundancy, Clustering, Load Balancing and Fault Tolerance. A systems 'Availability', RPO and RTO are derived from this and other requirements.

In general, the designer should consider Resiliency – Software, and Resiliency – Hardware NFR’s as a unit and engineer for both NFR’s in concert. In particular, the software must be designed so as to gracefully manage both software and hardware failures using robust transaction management and error handling. Failure modes and failure domains must be well understood.

Resiliency - Environmental Requirement: They ability of systems to continue business functionality upon physical failure of site environmentals, including power, cooling, and related components.

Incorporates redundant power, cooling, uninterruptable power, generator backup. A systems 'Availability', RPO and RTO are derived from this and other requirements.

This NFR specifies that the facilities-related components that support the system have the appropriate level of recoverability and resiliency. 

Designers should engineer for routine power and cooling failures and have appropriate back up power, alternate cooling, as necessary. Facilities failure domains such as power supplies, power distribution units, air conditioning units, etc. should be considered. 

Category: Recoverability

Recoverability requirements that describe the ability to recover from failed states and return the system to its as-built condition. Using the example of a failed unit of hardware, a resilient system will continue to function after failure, while a recoverable system will have a simple and predictable method for recovering from the hardware failure. Data backups, data replication, hot-swap hard drives, and automated operating system and application deployment tools may be technologies or techniques to recover a failed component. 

Our model references four Recoverability requirements: Component Recovery, Site Recoverability, Configuration Recovery and Logical Recovery. Each requirement may have multiple levels with each metric.

Recoverability-Component Requirement: The ability to repair or replace system components predictably, with minimum work effort, and with no loss or disruption of business functionality.

Incorporates traditional concepts of Configuration Management and Maintainability. Assures that components can be brought on line without maintenance windows.

While the resiliency NFR’s cover the behavior of systems when components fail, the recoverability NFR’s assure that the design of systems includes the ability to restore the system to its original, pre-failure state in a predictable manner. 

To assure component recoverability, the designer needs to assure that the configuration of all system components is known, and that a means exists to create new components that are identical to existing components.

Recoverability-Site Requirement: The ability of the system to resume business functionality upon physical or logical failure of the site housing components of the system.

Incorporates traditional concepts of Disaster Recovery, site failover, site replication, off-site backups. A systems 'Availability', RPO and RTO are derived from this and other requirements.

This NFR sets the minimum Recovery Point Objective (RPO) and Recovery Time Objective (RTO) that systems must meet under site related failures, such as data centers, buildings and campuses.

Recoverability - Configuration Requirement: The ability of the system to resume business functionality upon logical failure of system metadata or system configuration information.

Incorporates traditional concept of change management (portions of), configuration management, test and back-out plans for planned configuration changes.

The intent of this NFR is to provide assurance that the system is designed and managed such that if any portion of the configuration of the system is modified for any reason, intentionally or not, the system can be recovered back to the state that it was in pre-modification. This is intended to discourage systems in which the configuration is ad-hoc, unstructured, or 'mouse driven', as compared to template or script driven configurations. 

Recoverability - Logical RequirementThe ability of the system to resume business functionality upon logical failure of application managed business data.

Incorporates traditional concepts of database 'point in time recovery', file system snapshots and daily backups. A systems RPO is derived from this and other requirements.

This NFR is intended to assure that the system is designed so that after the data in a system has been modified outside of normal business practices (I.E logical file system or database corruption, poor configuration management, unauthorized data modification by either internal or external entities) the data managed by the systems can be recovered to a state at a point in time prior to the modification. 

Category: Scalability:

Our model has a single Scalability Requirement.  The requirement may have multiple levels with each metric.

Scalability requirements describe the ability to add and remove capacity to the system without affecting the availability of the system, while maximizing maintainability and constraining costs.

Scalability - Component Requirement: The ability to dynamically and cost effectively add or remove capacity by adding or removing hardware or software components. 

Incorporates the traditional concept of 'Horizontal Scalability', load balancing and dynamic capacity management. Assures that systems are compatible with cloud technologies.

The intent of this NFR is to force systems into a horizontally scalable architecture, and to limit or prohibit designs that depend on large-scale hardware upgrades to scale to additional capacity. I.E systems must be designed to scale out, not scale up. 

Category: Maintainability:

Our model has a single Maintainability Requirement.  The requirement may have multiple levels with each metric.

Maintainability  requirements describe the ability to maintain the system over its operational life. Among other attributes, a maintainable system can have routine hardware upgrades and application deployments without user affecting outages, it will have monitoring, logging and auditing sufficient for routine troubleshooting, it will have a low operational cost. Maintainability encompasses manageability, upgradability, deployability and flexibility as described by other authors. 

Maintainability-Component Requirement: The ability to maintain the hardware, software and environmental components of a system without disrupting business functionality, and with minimal or no planned system outages.

Incorporates traditional concepts of Service Management, Change Management (portions of), Maintenance Windows and Continuous Maintenance. Assures that effect of system maintenance on users will be minimized.

This requirement forces the designer to consider the maintainability of the system as a part of the design process. The designer should select and configure components such that:

  • Routine maintenance can be conducted on-line, using common technologies such as load balancing and clustering or equivalent.
  • Application patches and upgrades can be implemented on-line.
  • The release of new application functionality, including database schema changes, can be done on-line in many or most cases.

Category: Security:

The ability to maintain the confidentiality and integrity of a system and the data contain in or controlled by the system. Requirements related to system access, system integrity, system confidentiality and system configuration. 

Our model references five Security Requirements - Configuration Integrity, Configuration Assessment, Data Classification, Data Encryption, and Awareness and Training.

Security - Configuration Integrity Requirement: The ability to determine the source of modifications to the logical and physical configuration of a system. Logging and auditing of configuration information and changes. The ability to prevent or detect unauthorized changes to configuration or data. The ability to respond to unauthorized access or modification of system configuration or data. The ability to determine the configuration of a system at an arbitrary point in time in the past. 

Incorporates the traditional concepts of Configuration Management, Change Management (portions of), security auditing, Business Activity Logging, Intrusion Detection/Prevention and Malware Detection/Prevention, and security incident handling.

The intent of this requirement is to ensure that the system is designed so that:

  • The system can support/enable least privilege and role based system configuration.
  • Configuration changes are detectable. This implies that technologies such as routine, scheduled, continuous, or near-continuous configuration auditing. 
  • Auditing of changes in configuration creates an immutable audit trail, and the audit trail is properly secured.
  • The configuration of a system can be recovered back to the state that the system was in prior to the modification. 

Security - Configuration Assessment Requirement: The assurance that the initial configuration of the system is appropriately secure, that the system configuration is maintained in an appropriately secure state over the life of the system and that the state is verified and tested. 

Incorporates the traditional concepts of system hardening, code review, Vulnerability Management, Pen Tests, Patch Management and least privilege for access and modification of system configuration.

The intent of this requirement is to ensure that systems are initially configured to a secure state, and that they remain in that state over the life of the system.

  • The initial condition of the system is ‘hardened’ consistent with this requirement. 
  • A process or method must be implemented to ensure that the system is maintained in that state over its lifetime.
  • The condition of the system is verified periodically, depending on the Level within the requirement, for example by using vulnerability scans of systems and application code. 
  • The application code is written and tested in accordance with a formal software development practice.
  • Technologies, tools frameworks and libraries are implemented in a consistently secure manner. 

Security - Data Classification Requirement: The classification of data consistent with State and Federal regulations and the assignment of data ownership.

Security - Data Encryption Requirement: The conditions under which data must be transported, transmitted and stored in an unreadable, encrypted format.

Incorporates the traditional concepts of protecting data using encryption such that the data is only readable by authorized individuals.

The intent of this requirement is to ensure transport layer security is implemented for data that is transmitted over a less trusted network, and that encryption is implemented for data at rest. Encryption of data at rest may include full disk encryption, database encryption, and/or encryption of backup media.

Security - Data Access Requirement: The ability to limit logical and physical access to systems and data to authorized individuals, the ability to limit modification of systems and data to authorized individuals, the logging and auditing system and data access, and the ability to alert on unauthorized access.

Includes traditional concepts such as account provisioning and management, account credentials, authorization, least privileged based data access, business activity logging and audit logging, security perimeters and perimeter controls.

The intent of this requirement is to limit access to data based on need-to-know to perform job duties and to alert on inappropriate access, and/or have an audit trail of access or activities (i.e. read, write, modify, delete) that can be traced to an individual. 

Security - Awareness and Training Requirement: The assurance that system administrators are adequately skilled and knowledgeable in information security and the implementation, management and maintenance of systems for which they are responsible. 

The intent of this requirement is to ensure system administrative personnel have the skills, knowledge and/or experience to effectively implement requirements defined by Federal or State law, regulations, contractual agreements, Policies, Procedures or other non-functional requirements. 

Checkpoint:

I've described templates, categories and a high level view of our Non-Functional Requirements. Next up - a series of posts describing each requirement, followed by a framework for applying the NFR's to an IT system. 

Building a Non-Functional Requirements Framework - Overview

I'm planning on documenting a framework that we built for managing non-functional requirements. This is post #1 of the series. 

A pain point for our infrastructure and security teams was a lack of usable, consistent availability and security requirements for our internally developed applications. The business analysts worked with the organization to create requirements for the functionality of the application but ignored most of what infrastructure, identity management, and security would need until the end of the development process. By the time these teams got insight into the application it was too late to wedge in new requirements. The net was that the organization was promised applications or enhancements, but because no consideration had been made for non-functional requirements, deadlines were often missed. The worst example was the pending release of a major new application that allowed manipulation of financial information, but for which no consideration had been made for authentication, authorization requirements, or database & application hosting security. Retrofitting that project added a year to the timeline.

Additionally, we had a series of outstanding audit findings related to the lack of enterprise-wide standards for securing systems. We tended to build secure and available systems because we knew what we were doing - not because we built to an objective, measurable standard. Auditors would prefer that we built to a standard that ensured a secure, available system - and of course we agreed.

When I had a few months of down time (approx. 2012-2013) I decided to see what the state of art was in creating and maintaining non-functional requirements (NFR's). I looked at the obvious - FURPS+, ISO-9126, ISO-25010 and a handful of University published research papers. My biggest issue with the various existing models was that they were software specific. I felt that NFR's should apply to entire systems, not just the software running on the system.

As far as I could tell at the time, the various sources, authors, consultants and Gartner didn't really agree on much other than that NFR's are not Functional Requirement's and that you need to have some. I found that:
  • Many web sites have lists and examples of NFR's. 
  • Some try to define NFR's, few succeed. 
  • Others admit that NFR's are difficult to gather. 
  • Few apply NFR’s to systems (vs. software)
  • FURPS+, ISO-9126, ISO-25010 and similar didn't treat security as a first-class citizen, nor did they address legal requirements.
What I did find though, were a couple of sources that I thought I could use to build a set of generic non-functional requirements.
  • Erik Simmons and John Terzakis (Intel) each have a fair bit of good information in various presentations that are readily searchable.
  • Tom Gilb's 'Planguage' seemed like a valuable tool, and both Simmons and Terzakis describe how to use Planguage for requirements writing. 
See: 
Specifying Effective Non-Functional Requirements, John Terzakis Intel Corporation June 24, 2012 ICCGI Conference Venice, Italy
21st Century Requirements Engineering: A Pragmatic Guide to Best Practices, Erik Simmons, Intel Corporation 

These sources were close to being adaptable, but rather than try to adopt an existing framework as-is, I thought that it'd be best for us to come up with something usable by borrowing from various existing sources, primarily borrowing bits and pieces from Simmons, Terzakis, and Gilb.

Into the Non-Functional Requirement Abyss

We agreed that Requirements are not designs and should not specify a particular technology or configuration. Requirements should specify an end result, not the path to achieve that result.  We tried to keep this in mind as we worked out our framework. 

Our starting point (and first disagreement…) was on the definition of non-functional requirements. Here's what we used:
  • Functional Requirements describe the intended behavior of the system (or software), or what a system should do.
  • Non-functional Requirements describe how well the system does whatever it does and under what constraints the system must operate. NFR's describe operational characteristics, performance, availability, etc. 
We decided to leverage a permutation of the common 'S.M.A.R.T' framework as a requirement for writing the requirements. By placing bounds on the requirements writing process, we hoped that we'd end up with requirements that would have a chance of  being valuable to the organization. 

S.M.A.R.T.

Our version of 'S.M.A.R.T':

​Specific: Requirements will be clear, concise, unambiguous, with consistent terminology, and with detail sufficient such that designs based on the requirements will meet operational goals. 

​Measurable: A test can be devised that verifies the requirement using a bounded measurement.

​Attainable: The requirement is technically feasible within the constraints of current technology, and for which there is at least one design and implementation. 

​Realizable: The requirement is fiscally and manageably implementable within the constraints of organizational budget and staffing. 

​Unambiguous: The requirement will have a single, non-conflicting interpretation.
Traceable: The source of a requirement will be traceable to stakeholder need. The requirement is traceable to business strategy or roadmap. The life cycle of the requirement is traceable from its conception to its current state.

Specificity and Measurability were considered important because we hoped it would keep us from writing vague requirements or requirements for which there were no means of measuring attainment. 

Attainability and Realizability were intended to prevent the implementation of requirements for which there was no solution possible, or no solution that was actually implementable in our environment with our limited capabilities. 

Traceability was desired to prevent the imposition of requirements for which there was no business need (requirements for the sake of requirements, or requirements to give us an excuse to buy shiny new resume-building technology) or requirements that appeared out of nowhere or were modified outside of a formal process.

Requirement Categories

Be cause we like putting things in neat buckets, we created broad categories of NFR's for which we thought we'd have an immediate need. The various industry models have categories (Maintainability, Reliability, Portability, etc.) but our thinking at the time was that those categories didn't work for us. So we started from scratch and ended up with the following:

Resiliency - The requirements that describe the ability of the system to continue to function during common failure modes. A resilient system continues to work after routine failures (disk, server, OS or process). Resiliency is necessary to meet availability requirements and usability requirements. A resilient system may use technologies such as redundancy, clustering, load balancing, error handling, and error recovery to function after component failure. Resiliency encompasses the concepts of availability, reliability, robustness, fault tolerance and exception handling as described by other authors. 

Recoverability - The requirements that describe the ability to recover from failed states and return the system to its as-built condition. Using the example of a failed unit of hardware, a resilient system will continue to function after failure, a recoverable system will have a simple and predictable method for recovering from the hardware failure. Data backups, data replication, hot-swap hard drives, and automated operating system and application deployment tools may be technologies or techniques to recover a failed component. 

Maintainability - The requirements that describe the ability to maintain the system over its operational life. Among other attributes, a maintainable system can have routine hardware upgrades and application deployments without user affecting outages, it will have monitoring, logging and auditing sufficient for routine troubleshooting, it will have a low operational cost. Maintainability encompasses manageability, upgradability, deployability and flexibility as describe by other authors. 

Scalability -The requirements that describe the ability to add and remove capacity to the system without affecting the availability to the system, while maximizing maintainability and constraining costs. 

Security - The ability to maintain the confidentiality and integrity of a system and the data contained in or controlled by the system. Requirements related to system access, system integrity, system confidentiality and system configuration. 

These can be mapped back into FURPS+, ISO-9126 & ISO-25010 and ISO-27002, NIST 800-53, etc. 

Note that Availability, Performance, Reliability are not requirements categories in our model. We determined that if a system met a set of Resiliency, Recoverability and Security requirements, the system would also meet an appropriate level of availability and reliability as a byproduct of the Resiliency, Recoverability and Security Requirements. Likewise, the system would be able to meet Performance requirements as a byproduct of scalability and maintainability requirements.

Usability, Portability and Compatibility are common requirement families in other models, but as the model was driven by short-term infrastructure and security needs, they were left out in the early phases

Non-Functional Requirements Form & Format

Following the work done by Simmons & Terzakis (Intel) we decided to implement a modified template and Planguage-like structured language for the NFR's. Each NFR exists as a single document.

The Non-functional requirements template and definitions that we settled on are:

Category: A text field representing the category that the requirement is classified under in the Minnesota State Model. The Category and Context are equivalent to the 'ID:' in Planguage or 'Ambition' in (Simmons/Intel 2011).

Context: A text field representing the requirement, unique within a category. The Category and Context are equivalent to the 'ID:' in Planguage or 'Ambition' in (Simmons/Intel 2011). 

Goals: Natural language description of the intent of the requirement and how it supports one or more of the general goals. The Goal is equivalent to 'Gist:' in Planguage or 'Ambition' in (Simmons/Intel 2011). 

Rationale: The reason that the requirement exists. Expressed in natural language. 

Requirement: The requirement to which the system will be held, expressed in constrained natural language. Requirement will be written in a constrained natural language meeting Minnesota State Non-Functional Requirements Attributes.

Metric: Measurement used to determine if requirement has been met and the process or device used to locate the measurement on the scale. Metric must include 'Minimum', the minimum acceptable measurement, and may include 'Target', the measurement to which the system must be designed.

Scale: The scale of measure used to quantify the requirement.

Stakeholders: Persons who stand to gain or lose by implementation of requirements. Expressed as roles, not individuals. 

Implications: Implications to the stakeholders if these requirements are not met.

Applicability: Systems or categories of systems to which requirement applies.

Status: One of Draft, Approved, Revised, or other constrained choice of statuses matching the requirements implementing process.

Author: Person responsible for authoring and maintaining requirement.

Revision: Sequential number representing approved revision of requirement.

Date: Date of last revision of requirement

The NFR's have a structure and format that could be adapted to metadata driven requirements tooling.

Checkpoint

At this stage we had a handful of Non-Functional Requirements categories and a template for writing the NFR's, but no actual requirements.

Next up: Part #2 - A high level description of each Non-Functional Requirement

Thirty-Four Years in IT - Why not Thirty-Five?

After I was sidelined (Part 10) we had another leadership turnover. This time the turnover was welcome. I ended up in a leadership position under a new CIO. This allowed me to take advantage of some topics that I studied while I was sidelined. My new team took on a couple of challenges. (1) Introducing cloud computing to the organization, and (2) attempting to add a bit of architectural discipline to the development and infrastructure teams and processes. The first was somewhat successful, the later was not.

Cloud

I had been slowly working to get a master agreement with Amazon - a long, slow process when you are a public sector agency. When our new CIO mentioned 'cloud' I did a bit of digging and found out that Microsoft had added the phrase 'and Azure' to our master licensing agreement. Microsoft's foresight saved me months of contract negotiations. They made it trivial to set up an enterprise Azure account. So Azure became our default 'cloud'.

I had been running the typical nerdville home servers. Moving them from in-house Mac's to Linux in Azure was trivial - a weekend of messing around. I affirmed to our CIO that we had a fair number of apps that could be hosted in IaaS, and picked a couple of crash-test dummy apps for early migration. 

Myself and one of my staff spent a few months creating and destroying various assets in Azure, and came to the conclusion that the barriers to cloud adoption would be found mostly in our own staff, and not the technology stack. Infrastructure staff would have to re-think their jobs and their roles in the organization, and development staff would have to re-think application design. Both would challenge the organization. 

I also did a few quick-and-dirty demonstrations to get some ideas on how we might architect an enterprise framework for moving to Azure - such as hiding an Azure instance behind a firewall in our test  lab to show that we could create virtual data centers that appeared to be in our RFC-1918 address space, but were actually in Azure IaaS. We also presented quite a bit of what we learned to our campus IT staff at various events and get-togethers, hoping to build a bit of momentum at the campuses. 

On the down side, I ran into significant barriers within our own managers and their staff. A quorum of managers and staff were cloud-adverse and/or firmly committed to technologies and vendors that had no cloud play. We had to fight FuD from within.

Architecture

The Architecture activity was not successful. We had been running 'seat-of-the-pants' for years, resulting in many ad-hoc and orphaned tools, technologies and languages, and we were thinly staffed. So the idea that by adding rigor and overhead up front we'd end up with better technology that was less work to maintain was not well accepted. The entire concept of design first, then build was a tough sell, as the norm had been to start building first and figure out the design on the fly (if at all). Modern architectures such as presenting and API to our campuses were rejected outright. And of course the idea that two development teams or two infrastructure workgroups would agree on a tool, language, library - much less an architecture, was an even tougher sell.

The team (and any semblance of a formal architecture) was disbanded through attrition, and the body of standards, guidelines, processes, and practices are no doubt still in a SharePoint site, unmaintained and unloved. 

Why did I leave when I did?

As time went on, I found myself in fundamental disagreement with how the organization treated its people. Leadership was making personnel decisions that I could not support, that caused the loss of several of our best people, and that placed other staff in places where they could not succeed or by happy.

That leadership would move staff into positions in which they had no interest, and do it without the concurrance of their manager (me) was unacceptable. To pile on work that was outside the core skillset of an employee, and then try to destroy their career when they were failing, is unacceptable. I don't want to work for an organization like that, and because of financial decisions I made years ago I do not have to work for an organization like that. 

I did the math, got my ducks in a row, and retired. 

My only regret is that I was unable to influence the disposition of the staff that I left behind. 

Previous: Part 10 Leadership Chaos, Career derailed

Thirty-four Years in IT - Leadership Chaos, Career Derailed (Part 10)

This post is the hardest one to write. I've been thinking about it for years without being able to put words to paper. With the COVID-19 stay-at-home directive, I can't procrastinate anymore, so here goes.

As outlined in Part 9, Fall 2011 was a tough period. To make it tougher, the CIO decided to hire two new leadership-level positions - a new CISO over the security group, and a new Associate Vice Chancellor (AVC) over the Infrastructure group. The infrastructure AVC would be my new boss.

The CISO position was really interesting to me. The infrastructure position was not as interesting, as it would have been more of the same but with more stress and more headaches. I applied, was interviewed and rejected for both. I'm sure that part of the problem was that with the chaos of our poorly written ERP application and the Oracle database issues that Fall, I really didn't prepare for either interview. Not having interviewed for a job in more than a decade didn't help either.

Both hires ended up being bad for the organization and for my career. I'm pretty sure that both knew that I had been a candidate for the positions and both were threatened by me.

The new CISO was determined to sideline me and break down the close cooperation between my team and the security team. Whereas we had been working together for years, the security team was now restricted from communicating with me without the new CISO's permission. I was blackballed - cut out of all security related incidents, conversations, and meetings. Anything that had my fingerprints on it was trashed, either literally or figuratively. Staff who had worked closely with me in the past were considered disloyal to him and were sidelined and harassed.

The new CISO also declared that we were 'too secure' and tried to get a consultant to write up a formal document to that effect. Whatever security related projects we had in the pipeline were killed off. Rigorous processes around firewall rules, server hardening and data center security were ignored. Security would no longer impact the ability to deploy technology.

The new Infrastructure AVC started out by pulling projects from me without telling me, meeting with my staff without me in the room and telling them I was 'doing it wrong'. Staff were still loyal to me and kept me informed as to what was transpiring. It was clear that I was viewed as a threat and was not welcome.

I confronted my new boss and advised that if he were going to manage my staff without me in the room, that he might as well move them directly under him on the org chart. He had a bit of a shocked look on his face, and then obliged. I also advised that as I now had no staff and no role in the organization, he needed to find me something to do.

I knew that he'd have a hard time firing me - I was protected by Civil Service rules, but I also knew that my work environment would be poor until either he and I figured out how to work together or one of us left. My choice was to try to stick it out and make the best of it or move on. I probably had options either within the State University system or with the State of Minnesota. I really am a Higher Ed. guy though, so I was reluctant to move. I decided to wait it out - and meanwhile get my financial ducks in order and put out job feelers.

He responded by blackballing me from any conversation of significance, by trashing me in e-mails to colleagues, by making it clear to my former staff that I was not to have any work related conversation with them without him, and that referencing anything that I had said or done the last dozen years was unwelcome. At one point I had to advise my former staff that they should not be seen with me, as it might impact their relationship with the new bosses. In an effort to convince me to leave (or perhaps out of sympathy), he even called me into his office and showed me a job posting at another State agency that he thought might be interesting to me.

He also moved me out of the IT area and across the hall into finance, where I would not be available to my former staff (and where I made a couple of great friends).

The environment was chaotic and toxic. Teams got rearranged and disrupted with no clear idea why or what outcome was expected. Moral was poor, tempers were high. A new director/manager was hired into my old position who was extremely toxic. As one could predict, some of our best staff left and others lost enthusiasm and dedication. I ended up fielding requests to be a job reference for many of my former staff.

After about six months he and I smoothed things out to the point where we could work together, as long as I stayed away from his (my former) staff and offered no thoughts on anything he was doing to anyone other than him. I had no clear responsibilities and as long as I stayed out of his sandbox I could do pretty much whatever I wanted. So I used that time to re-think quite a bit of what I had been doing, and in particular to lay groundwork for work that paid off a few years down the road, work that I'm quite proud of and will write about at a later date.

After about a year and a half we had another CIO change and both the CISO and AVC left. Ironically, on the AVC's last day I was the one who helped him clear his office, walked him out to the parking ramp and saw him off.

About that time the 'toxic' director also left. A couple of us who were black sheep ended up back in the thick of things when we found out how badly our technology and security had degraded. That's also when we found out that the 'completed' plans for moving a data center two months out did not exist.

The nightmare was over, but much damage had been done.

In retrospect, should I have left the organization? I'm not sure. For me it was very difficult to watch what myself and my team had built over the last fifteen years get torn apart, especially when what came out of the teardown was what I believed to be inferior to what we had. If what resulted was an improvement, it would have been easy. Very little of the technology that ran the system was built by anyone other than us. My fingerprint was on everything - good or bad, right or wrong. And to the new CISO and AVC, everything was bad and wrong.

 But the data center got moved on time.

Part 9 - The Application that Almost Broke Me

Thirty-four years in IT - The Application That Almost Broke Me (Part 9)

The last half of 2011 was for me an my team a really, really tough time.

As I hinted to in this post, by August 2011 we were buried in Oracle 11 & application performance problems. By the time we were back into a period of relative stability that December, we had:

  • Six Oracle Sev 1's open at once, the longest open for months. The six incidents were updated a combined total of 800 times before they finally were all resolved. 
  • Multiple extended database outages, most during peak activity at the beginning of the semester. 
  • Multiple 24-hour+ Oracle support calls.
  • An on-site Oracle engineer.
  • A corrupt on-disk database forcing a point-in-time recovery from backups of our student primary records/finance/payroll database.
  • Extended work hours and database patches and configuration changes more weekends than not.
  • A forced re-write of major sections of the application to mitigate extremely poor design choices.
The causes were several:
  1. Our applications, in order to work around old RDB bugs, was deliberately coded with literal strings in queries instead of passing variables as parameters. 
  2. The application also carried large amounts of legacy code that scanned large, multi-million row database tables one row at a time, selecting each row in turn and performing operations on that row. Just like in the days of Hollerith cards. 
  3. The combination of literals and single-row queries resulted in the Oracle SGA shared pool becoming overrun with simple queries, each used only once, cached, and then discarded. At times we were hard-parsing many thousands of queries per second, each with a literal string in the query, and each referenced and executed exactly once. 
  4. A database engine that mutexed itself to death while trying to parse, insert and expire those queries from the SGA library cache.
  5. Listener crashes that caused the app - lacking basic error handling - to fail and required an hour or so to recover.
Also:
  1. We missed one required Solaris patch that may have impacted the database.
  2. We likely were overrunning the interrupts and network stack on the E25k network cards and/or Solaris 10 drivers as we performed many thousands of trivial queries per second. This may have been the cause of our frequent listener crashes.
None of this was obvious from AWR's, and it was only after several outages and after we built tools to query the SGA that we saw where the problem might be. What finally got us going in a good direction was seeing a library cache with a few hundred thousand of queries like this:

select from student where student_id - '9876543';
select from student where student_id - '4982746';
select from student where student_id - '4890032';
select from student where student_id - '4566621';
[...]

Our app killed the database - primarily because of poor application design, but also because of Oracle bugs. 

An analysis of the issue by and Oracle engineer, from one of the SR's:
... we have also identified another serious issue that is stemming from your application design using literals and is also a huge contributor to the fragmentation issues. There is one sql that is the same but only differs with literals and had 67,629 different versions in the shared pool.
Along with the poor application design, we also hit a handful of mutex-related bugs specific to 11.2.0.x that were related to applications with our particular design. We patched those as soon as we could. We also figured out that network cards on SPARC E25k's can only do about 50,000 interrupts per second, and that adding more network cards would finally resolve some of the issues we were having with the database listeners.

Pythian has a good description of a similar issue - which had it been written a year earlier, would have saved us a lot of pain. 

Why didn't this happen on Oracle 10? 

I suspect that in Oracle 10, the SGA size was physically limited and that the database engine just simple churned through literal queries, hard-parsed them, tossed them out of memory, and drove up the CPU. But it never ran into mutex issues. It was in 'hard-parse-hell' but other than high CPU, worked OK. In Oracle 11, the SGA must have ben significantly re-written, as it was clear that the SGA was allowed to grow very large in memory, which (by our analysis) resulted in many tens of thousands of queries in the SGA, being churned through at a rate of many thousands per second. 

Along the way we also discovered COBOL programs that our system admins had been complaining about for 15 years - such as the program that scanned millions of individual records in the person table, one at a time, looking for who needs to get paid this week. Never mind that they could have answered that question with a single query. And of course the program did this scan twenty-six times, once for each pay period in the last year - just in case an old timecard had been modified. 

Brutal.

I insisted that our developers re-code the worst parts of the application - arguing that any other fix would at best kick the can down the road. 

In any case, by the time we reached our next peak load at semester start January '12, enough had been fix that the database ran fine - probably better than ever.   

But it cost us dearly. We worked most weekends that fall to rushed changes/patches/re-configurations, one of my staff ended up in the hospital, and I aged 5 years in as many months. 

In my next post I'll outline the other significant events in 2011/2012, which altered my job and forced me to re-evaluate my career. 

Thirty-four years in IT - Swimming with the Itanic (Part 8)

For historical reasons, we were a strong VMS shop. Before they imploded, Digital Equipment treated EDU's very kindly, offering extremely good pricing on software in exchange for hardware adoption. In essence, a college could get an unlimited right to use a whole suite of Digital Equipment software for a nominal annual fee, and Digital had a very complete software catalog. So starting in the early 1990's, our internally developed student records system (ERP) ended up on the VMS/VAX/RDB stack.

Digital imploded and got bought by Compaq, who got bought by HP, Somewhere along the line the RDB database line ended up at Oracle.

For most of our time on VMS & RDB we suffered from severe performance problems. Our failure in addressing the problems was two-fold - the infrastructure team didn't have good performance data to feed back to the developers, and the development team considered performance to be an infrastructure/hardware problem. This resulted in a series of frantic and extremely expensive scrambles to upgrade VAX/Alpha server hardware. It did not however, result in any significant effort to improve the application design.

Between 1993 and 2005, we cycled through each of:
  1. Standalone VAX 4000's
  2. Clustered AlphaServer 4100's
  3. Standalone AlphaServer GS140's
  4. Standalone AlphaServer GS160's
And of course mid-life upgrades to each platform. 

Each upgrade cost $millions in hardware, and each upgrade only solved performance problems for a brief period of time. The GS160's lasted the longest and performed the best, but at an extremely high cost. At no point in time did we drill deeply into application architecture and determine where the performance problems originated.

During that time frame we got advice from Gartner that suggested that moving from VMS to Unix was desirable, but moving from RDB to Oracle was critical, as they did not expect Oracle to live up to their support commitments for the RDB database product. So in 2009 we moved from 35 individual RDB databases spread across four GS160's, to one Oracle 10G database on a Sun Microsystems E25k, in a single, extremely well implemented weekend-long database migration, platform migration, and 35:1 database merger. Kudos to the development team for pulling that off.

Unfortunately we carried forward large parts of the poor application design and transferred the performance problems from RDB to Oracle. At time though, the DBA's were part of my team. I had a very good Oracle DBA and Unix sysadmin, both of whom were able to dig into performance problems and communicate back to developers. We were pretty good at detailing the performance problems and offering remedies and suggested design changes. 

Though performance slowly got better, the full impact of poor application design was yet to be felt.

As soon as the databases were combined and hosted on SPARC hardware, continuing with the GS160's made no sense. They were costing $600k/yr in hardware and software maintenance, now were significantly oversize, and were still running the dead-end OpenVMS operating system. This put us in a tough spot. The development team was focused on minimizing their commitment to any re-platforming and was only interested in a move from AlphaServer to Itanium. For me, Itanium (or Itanic, as I called it at the time) was a dead end, and our only move should be to Unix (Solaris). But because the cost to migrate to Itanic was much lower - the application would only have to be recompiled, not re-platformed - the Itanic advocates won the argument. We ended up purchasing Itanium blade servers at a 3-year cost roughly equal to 18 months of support on the GS160's.

By that time HP's support for OpenVMS had eroded badly. Support for Oracle clients, Java, and other commonly used software was poor or non-existent. That OpenVMS was dead was visible to all but the few for whom OpenVMS was a religious experience.

As we were bashing the decision around in 2009, I strongly suggested that if we purchased Itanium in we'd be on the dead-end OpenVMS platform for five more years. I was wrong. We were on Itanium AlphaServer blades and OpenVMS nine years, until 2018. The (only) good part of that decision was that the Itanium blade servers ran very well and were inexpensive to maintain. And as OpenVMS was pretty dead by then, we did not spend very much time on patches and upgrades, as few were forthcoming from HP.

This is a case where our reluctance to take on some short-term pain resulted in our having to maintain a dead-end obsolete system for many years. 

Thirty-four Years in IT - Addressing Application Security (Part 7)

In the 2008-2009 period, we finally started to seriously address application layer security in our development group.

By that time is was clear that the threat to hosted applications had moved up the stack, and that the center of gravity had shifted towards compromising the web applications rather that the hosting infrastructure. This meant that our applications, for which essentially no serious security related effort had been made, had to finally receive some attention. Our development teams were not tuned in to the security landscape and thus were paying scant attention to web application security. As our home-grown applications exposure to the Internet was mostly limited to simple, student facing functionality such as course registration and grading, the lack of attention was perceived as appropriate by all but a few of us infrastructure and security geeks.

In other words, the dev teams were at the unconscious/incompetent level of the Conscious Competence Matrix.