Last In - First Out

Building Non-Functional Requirements Framework - Requirements Categories

2020-12-23T20:48:00.001-06:00

I'm planning on documenting a framework that we built for managing non-functional requirements. This is post #2 of the series.

In Post #1, Last In - First Out: Building a Non-Functional Requirements Framework - Overview I outlined the template and definitions for our Non-Functional Requirements.

We also had to address outstanding audit findings that pointed out the lack of enterprise-wide security standards. Blank templates weren't going to cut it. The next steps were to create a generic set of Non-Functional requirements within each category, applicable to any system that we'd likely encounter. We then followed up with a structured, objective framework for applying the requirements to a particular system. The next few posts will cover these topics.

To make the NFR's re-usable and applicable to as many systems as possible, we created multiple Metrics within each NFR. Systems for which requirements could be relatively simple would be required to meet a lower Metrics, while systems for which requirements needed to be higher/stricter would meet the higher Metrics in the NFR. The Metrics were designed so that the very lowest level would be applicable to a single personal computing device with no stored confidential data, the highest Metric would be applicable to our largest system with the most confidential or financial data, and the in-between Metrics would be applicable to systems of varying levels of security and availability requirements in between the extremes. This allowed us to create a single Requirement applicable to many (or any) systems proportional to their relative value, and without subjecting low value systems to rigorous requirements.

Note that Availability, Performance and Reliability requirements found in other models are not requirements categories in our model. We determined that if a system met a set of Resiliency, Recoverability and Security requirements, the system would also meet an appropriate level of availability and reliability as a byproduct of the Resiliency, Recoverability and Security Requirements. Likewise, the system would be able to meet Performance requirements as a byproduct of scalability and maintainability requirements.

Usability, Portability and Compatibility are common requirement families in other models, but as the model was driven by short-term infrastructure and security needs, they were left out in the early phases

Keep in mind that these categories and requirements were designed to be usable in our environment - a public College and University system.

The categories and a high level description of the requirements in each category follow:

Category: Resiliency

Resiliency requirements describe the ability of the system to continue to function during common failure modes. A resilient system continues to work after routine failures (disk, server, OS or process). Resiliency is necessary to meet availability requirements and usability requirements. A resilient system may use technologies such as redundancy, clustering, load balancing, error handling, and error recovery to function after component failure. Resiliency encompasses the concepts of availability, reliability, robustness, fault tolerance and exception handling as described by other authors.

Our model references three Resiliency requirements - Hardware Resiliency, Software Resiliency, and Environmental Resiliency. Each requirement may have multiple levels with each metric.

Resiliency-Hardware Requirement: The ability of the system to continue business functionality upon physical failure of hardware components that make up the system.

Incorporates traditional concepts of Redundancy, Clustering, Load Balancing and Fault Tolerance. A systems 'Availability', RPO and RTO are derived from this and other requirements.

This requirement is intended to force the designer to leverage high availability technologies for systems in which the impact of an unavailable system reaches certain thresholds.

Resiliency-Software Requirement: The ability of the system to continue business functionality upon logical failure of software components that make up the system.

Incorporates traditional concepts of Redundancy, Clustering, Load Balancing and Fault Tolerance. A systems 'Availability', RPO and RTO are derived from this and other requirements.

In general, the designer should consider Resiliency – Software, and Resiliency – Hardware NFR’s as a unit and engineer for both NFR’s in concert. In particular, the software must be designed so as to gracefully manage both software and hardware failures using robust transaction management and error handling. Failure modes and failure domains must be well understood.

Resiliency - Environmental Requirement: They ability of systems to continue business functionality upon physical failure of site environmentals, including power, cooling, and related components.

Incorporates redundant power, cooling, uninterruptable power, generator backup. A systems 'Availability', RPO and RTO are derived from this and other requirements.

This NFR specifies that the facilities-related components that support the system have the appropriate level of recoverability and resiliency.

Designers should engineer for routine power and cooling failures and have appropriate back up power, alternate cooling, as necessary. Facilities failure domains such as power supplies, power distribution units, air conditioning units, etc. should be considered.

Category: Recoverability

Recoverability requirements that describe the ability to recover from failed states and return the system to its as-built condition. Using the example of a failed unit of hardware, a resilient system will continue to function after failure, while a recoverable system will have a simple and predictable method for recovering from the hardware failure. Data backups, data replication, hot-swap hard drives, and automated operating system and application deployment tools may be technologies or techniques to recover a failed component.

Our model references four Recoverability requirements: Component Recovery, Site Recoverability, Configuration Recovery and Logical Recovery. Each requirement may have multiple levels with each metric.

Recoverability-Component Requirement: The ability to repair or replace system components predictably, with minimum work effort, and with no loss or disruption of business functionality.

Incorporates traditional concepts of Configuration Management and Maintainability. Assures that components can be brought on line without maintenance windows.

While the resiliency NFR’s cover the behavior of systems when components fail, the recoverability NFR’s assure that the design of systems includes the ability to restore the system to its original, pre-failure state in a predictable manner.

To assure component recoverability, the designer needs to assure that the configuration of all system components is known, and that a means exists to create new components that are identical to existing components.

Recoverability-Site Requirement: The ability of the system to resume business functionality upon physical or logical failure of the site housing components of the system.

Incorporates traditional concepts of Disaster Recovery, site failover, site replication, off-site backups. A systems 'Availability', RPO and RTO are derived from this and other requirements.

This NFR sets the minimum Recovery Point Objective (RPO) and Recovery Time Objective (RTO) that systems must meet under site related failures, such as data centers, buildings and campuses.

Recoverability - Configuration Requirement: The ability of the system to resume business functionality upon logical failure of system metadata or system configuration information.

Incorporates traditional concept of change management (portions of), configuration management, test and back-out plans for planned configuration changes.

The intent of this NFR is to provide assurance that the system is designed and managed such that if any portion of the configuration of the system is modified for any reason, intentionally or not, the system can be recovered back to the state that it was in pre-modification. This is intended to discourage systems in which the configuration is ad-hoc, unstructured, or 'mouse driven', as compared to template or script driven configurations.

Recoverability - Logical Requirement: The ability of the system to resume business functionality upon logical failure of application managed business data.

Incorporates traditional concepts of database 'point in time recovery', file system snapshots and daily backups. A systems RPO is derived from this and other requirements.

This NFR is intended to assure that the system is designed so that after the data in a system has been modified outside of normal business practices (I.E logical file system or database corruption, poor configuration management, unauthorized data modification by either internal or external entities) the data managed by the systems can be recovered to a state at a point in time prior to the modification.

Category: Scalability:

Our model has a single Scalability Requirement. The requirement may have multiple levels with each metric.

Scalability requirements describe the ability to add and remove capacity to the system without affecting the availability of the system, while maximizing maintainability and constraining costs.

Scalability - Component Requirement: The ability to dynamically and cost effectively add or remove capacity by adding or removing hardware or software components.

Incorporates the traditional concept of 'Horizontal Scalability', load balancing and dynamic capacity management. Assures that systems are compatible with cloud technologies.

The intent of this NFR is to force systems into a horizontally scalable architecture, and to limit or prohibit designs that depend on large-scale hardware upgrades to scale to additional capacity. I.E systems must be designed to scale out, not scale up.

Category: Maintainability:

Our model has a single Maintainability Requirement. The requirement may have multiple levels with each metric.

Maintainability requirements describe the ability to maintain the system over its operational life. Among other attributes, a maintainable system can have routine hardware upgrades and application deployments without user affecting outages, it will have monitoring, logging and auditing sufficient for routine troubleshooting, it will have a low operational cost. Maintainability encompasses manageability, upgradability, deployability and flexibility as described by other authors.

Maintainability-Component Requirement: The ability to maintain the hardware, software and environmental components of a system without disrupting business functionality, and with minimal or no planned system outages.

Incorporates traditional concepts of Service Management, Change Management (portions of), Maintenance Windows and Continuous Maintenance. Assures that effect of system maintenance on users will be minimized.

This requirement forces the designer to consider the maintainability of the system as a part of the design process. The designer should select and configure components such that:

Routine maintenance can be conducted on-line, using common technologies such as load balancing and clustering or equivalent.
Application patches and upgrades can be implemented on-line.
The release of new application functionality, including database schema changes, can be done on-line in many or most cases.

Category: Security:

The ability to maintain the confidentiality and integrity of a system and the data contain in or controlled by the system. Requirements related to system access, system integrity, system confidentiality and system configuration.

Our model references five Security Requirements - Configuration Integrity, Configuration Assessment, Data Classification, Data Encryption, and Awareness and Training.

Security - Configuration Integrity Requirement: The ability to determine the source of modifications to the logical and physical configuration of a system. Logging and auditing of configuration information and changes. The ability to prevent or detect unauthorized changes to configuration or data. The ability to respond to unauthorized access or modification of system configuration or data. The ability to determine the configuration of a system at an arbitrary point in time in the past.

Incorporates the traditional concepts of Configuration Management, Change Management (portions of), security auditing, Business Activity Logging, Intrusion Detection/Prevention and Malware Detection/Prevention, and security incident handling.

The intent of this requirement is to ensure that the system is designed so that:

The system can support/enable least privilege and role based system configuration.
Configuration changes are detectable. This implies that technologies such as routine, scheduled, continuous, or near-continuous configuration auditing.
Auditing of changes in configuration creates an immutable audit trail, and the audit trail is properly secured.
The configuration of a system can be recovered back to the state that the system was in prior to the modification.

Security - Configuration Assessment Requirement: The assurance that the initial configuration of the system is appropriately secure, that the system configuration is maintained in an appropriately secure state over the life of the system and that the state is verified and tested.

Incorporates the traditional concepts of system hardening, code review, Vulnerability Management, Pen Tests, Patch Management and least privilege for access and modification of system configuration.

The intent of this requirement is to ensure that systems are initially configured to a secure state, and that they remain in that state over the life of the system.

The initial condition of the system is ‘hardened’ consistent with this requirement.
A process or method must be implemented to ensure that the system is maintained in that state over its lifetime.
The condition of the system is verified periodically, depending on the Level within the requirement, for example by using vulnerability scans of systems and application code.
The application code is written and tested in accordance with a formal software development practice.
Technologies, tools frameworks and libraries are implemented in a consistently secure manner.

Security - Data Classification Requirement: The classification of data consistent with State and Federal regulations and the assignment of data ownership.

Security - Data Encryption Requirement: The conditions under which data must be transported, transmitted and stored in an unreadable, encrypted format.

Incorporates the traditional concepts of protecting data using encryption such that the data is only readable by authorized individuals.

The intent of this requirement is to ensure transport layer security is implemented for data that is transmitted over a less trusted network, and that encryption is implemented for data at rest. Encryption of data at rest may include full disk encryption, database encryption, and/or encryption of backup media.

Security - Data Access Requirement: The ability to limit logical and physical access to systems and data to authorized individuals, the ability to limit modification of systems and data to authorized individuals, the logging and auditing system and data access, and the ability to alert on unauthorized access.

Includes traditional concepts such as account provisioning and management, account credentials, authorization, least privileged based data access, business activity logging and audit logging, security perimeters and perimeter controls.

The intent of this requirement is to limit access to data based on need-to-know to perform job duties and to alert on inappropriate access, and/or have an audit trail of access or activities (i.e. read, write, modify, delete) that can be traced to an individual.

Security - Awareness and Training Requirement: The assurance that system administrators are adequately skilled and knowledgeable in information security and the implementation, management and maintenance of systems for which they are responsible.

The intent of this requirement is to ensure system administrative personnel have the skills, knowledge and/or experience to effectively implement requirements defined by Federal or State law, regulations, contractual agreements, Policies, Procedures or other non-functional requirements.

Checkpoint:

I've described templates, categories and a high level view of our Non-Functional Requirements. Next up - a series of posts describing each requirement, followed by a framework for applying the NFR's to an IT system.

Building a Non-Functional Requirements Framework - Overview

2020-12-22T14:44:00.002-06:00

I'm planning on documenting a framework that we built for managing non-functional requirements. This is post #1 of the series.

A pain point for our infrastructure and security teams was a lack of usable, consistent availability and security requirements for our internally developed applications. The business analysts worked with the organization to create requirements for the functionality of the application but ignored most of what infrastructure, identity management, and security would need until the end of the development process. By the time these teams got insight into the application it was too late to wedge in new requirements. The net was that the organization was promised applications or enhancements, but because no consideration had been made for non-functional requirements, deadlines were often missed. The worst example was the pending release of a major new application that allowed manipulation of financial information, but for which no consideration had been made for authentication, authorization requirements, or database & application hosting security. Retrofitting that project added a year to the timeline.

Additionally, we had a series of outstanding audit findings related to the lack of enterprise-wide standards for securing systems. We tended to build secure and available systems because we knew what we were doing - not because we built to an objective, measurable standard. Auditors would prefer that we built to a standard that ensured a secure, available system - and of course we agreed.

When I had a few months of down time (approx. 2012-2013) I decided to see what the state of art was in creating and maintaining non-functional requirements (NFR's). I looked at the obvious - FURPS+, ISO-9126, ISO-25010 and a handful of University published research papers. My biggest issue with the various existing models was that they were software specific. I felt that NFR's should apply to entire systems, not just the software running on the system.

As far as I could tell at the time, the various sources, authors, consultants and Gartner didn't really agree on much other than that NFR's are not Functional Requirement's and that you need to have some. I found that:

Many web sites have lists and examples of NFR's.
Some try to define NFR's, few succeed.
Others admit that NFR's are difficult to gather.
Few apply NFR’s to systems (vs. software)
FURPS+, ISO-9126, ISO-25010 and similar didn't treat security as a first-class citizen, nor did they address legal requirements.

What I did find though, were a couple of sources that I thought I could use to build a set of generic non-functional requirements.

Erik Simmons and John Terzakis (Intel) each have a fair bit of good information in various presentations that are readily searchable.
Tom Gilb's 'Planguage' seemed like a valuable tool, and both Simmons and Terzakis describe how to use Planguage for requirements writing.

See:

Specifying Effective Non-Functional Requirements, John Terzakis Intel Corporation June 24, 2012 ICCGI Conference Venice, Italy
21st Century Requirements Engineering: A Pragmatic Guide to Best Practices, Erik Simmons, Intel Corporation

These sources were close to being adaptable, but rather than try to adopt an existing framework as-is, I thought that it'd be best for us to come up with something usable by borrowing from various existing sources, primarily borrowing bits and pieces from Simmons, Terzakis, and Gilb.

Into the Non-Functional Requirement Abyss

We agreed that Requirements are not designs and should not specify a particular technology or configuration. Requirements should specify an end result, not the path to achieve that result. We tried to keep this in mind as we worked out our framework.

Our starting point (and first disagreement…) was on the definition of non-functional requirements. Here's what we used:

Functional Requirements describe the intended behavior of the system (or software), or what a system should do.
Non-functional Requirements describe how well the system does whatever it does and under what constraints the system must operate. NFR's describe operational characteristics, performance, availability, etc.

We decided to leverage a permutation of the common 'S.M.A.R.T' framework as a requirement for writing the requirements. By placing bounds on the requirements writing process, we hoped that we'd end up with requirements that would have a chance of being valuable to the organization.

S.M.A.R.T.

Our version of 'S.M.A.R.T':

Specific: Requirements will be clear, concise, unambiguous, with consistent terminology, and with detail sufficient such that designs based on the requirements will meet operational goals.

Measurable: A test can be devised that verifies the requirement using a bounded measurement.

Attainable: The requirement is technically feasible within the constraints of current technology, and for which there is at least one design and implementation.

Realizable: The requirement is fiscally and manageably implementable within the constraints of organizational budget and staffing.

Unambiguous: The requirement will have a single, non-conflicting interpretation.

Traceable: The source of a requirement will be traceable to stakeholder need. The requirement is traceable to business strategy or roadmap. The life cycle of the requirement is traceable from its conception to its current state.

Specificity and Measurability were considered important because we hoped it would keep us from writing vague requirements or requirements for which there were no means of measuring attainment.

Attainability and Realizability were intended to prevent the implementation of requirements for which there was no solution possible, or no solution that was actually implementable in our environment with our limited capabilities.

Traceability was desired to prevent the imposition of requirements for which there was no business need (requirements for the sake of requirements, or requirements to give us an excuse to buy shiny new resume-building technology) or requirements that appeared out of nowhere or were modified outside of a formal process.

Requirement Categories

Be cause we like putting things in neat buckets, we created broad categories of NFR's for which we thought we'd have an immediate need. The various industry models have categories (Maintainability, Reliability, Portability, etc.) but our thinking at the time was that those categories didn't work for us. So we started from scratch and ended up with the following:

Resiliency - The requirements that describe the ability of the system to continue to function during common failure modes. A resilient system continues to work after routine failures (disk, server, OS or process). Resiliency is necessary to meet availability requirements and usability requirements. A resilient system may use technologies such as redundancy, clustering, load balancing, error handling, and error recovery to function after component failure. Resiliency encompasses the concepts of availability, reliability, robustness, fault tolerance and exception handling as described by other authors.

Recoverability - The requirements that describe the ability to recover from failed states and return the system to its as-built condition. Using the example of a failed unit of hardware, a resilient system will continue to function after failure, a recoverable system will have a simple and predictable method for recovering from the hardware failure. Data backups, data replication, hot-swap hard drives, and automated operating system and application deployment tools may be technologies or techniques to recover a failed component.

Maintainability - The requirements that describe the ability to maintain the system over its operational life. Among other attributes, a maintainable system can have routine hardware upgrades and application deployments without user affecting outages, it will have monitoring, logging and auditing sufficient for routine troubleshooting, it will have a low operational cost. Maintainability encompasses manageability, upgradability, deployability and flexibility as describe by other authors.

Scalability -The requirements that describe the ability to add and remove capacity to the system without affecting the availability to the system, while maximizing maintainability and constraining costs.

Security - The ability to maintain the confidentiality and integrity of a system and the data contained in or controlled by the system. Requirements related to system access, system integrity, system confidentiality and system configuration.

These can be mapped back into FURPS+, ISO-9126 & ISO-25010 and ISO-27002, NIST 800-53, etc.

Note that Availability, Performance, Reliability are not requirements categories in our model. We determined that if a system met a set of Resiliency, Recoverability and Security requirements, the system would also meet an appropriate level of availability and reliability as a byproduct of the Resiliency, Recoverability and Security Requirements. Likewise, the system would be able to meet Performance requirements as a byproduct of scalability and maintainability requirements.

Non-Functional Requirements Form & Format

Following the work done by Simmons & Terzakis (Intel) we decided to implement a modified template and Planguage-like structured language for the NFR's. Each NFR exists as a single document.

The Non-functional requirements template and definitions that we settled on are:

Category: A text field representing the category that the requirement is classified under in the Minnesota State Model. The Category and Context are equivalent to the 'ID:' in Planguage or 'Ambition' in (Simmons/Intel 2011).

Context: A text field representing the requirement, unique within a category. The Category and Context are equivalent to the 'ID:' in Planguage or 'Ambition' in (Simmons/Intel 2011).

Goals: Natural language description of the intent of the requirement and how it supports one or more of the general goals. The Goal is equivalent to 'Gist:' in Planguage or 'Ambition' in (Simmons/Intel 2011).

Rationale: The reason that the requirement exists. Expressed in natural language.

Requirement: The requirement to which the system will be held, expressed in constrained natural language. Requirement will be written in a constrained natural language meeting Minnesota State Non-Functional Requirements Attributes.

Metric: Measurement used to determine if requirement has been met and the process or device used to locate the measurement on the scale. Metric must include 'Minimum', the minimum acceptable measurement, and may include 'Target', the measurement to which the system must be designed.

Scale: The scale of measure used to quantify the requirement.

Stakeholders: Persons who stand to gain or lose by implementation of requirements. Expressed as roles, not individuals.

Implications: Implications to the stakeholders if these requirements are not met.

Applicability: Systems or categories of systems to which requirement applies.

Status: One of Draft, Approved, Revised, or other constrained choice of statuses matching the requirements implementing process.

Author: Person responsible for authoring and maintaining requirement.

Revision: Sequential number representing approved revision of requirement.

Date: Date of last revision of requirement

The NFR's have a structure and format that could be adapted to metadata driven requirements tooling.

Checkpoint

At this stage we had a handful of Non-Functional Requirements categories and a template for writing the NFR's, but no actual requirements.

Next up: Part #2 - A high level description of each Non-Functional Requirement

Thirty-Four Years in IT - Why not Thirty-Five?

2020-11-18T19:43:00.000-06:00

After I was sidelined (Part 10) we had another leadership turnover. This time the turnover was welcome. I ended up in a leadership position under a new CIO. This allowed me to take advantage of some topics that I studied while I was sidelined. My new team took on a couple of challenges. (1) Introducing cloud computing to the organization, and (2) attempting to add a bit of architectural discipline to the development and infrastructure teams and processes. The first was somewhat successful, the later was not.

Cloud

I had been slowly working to get a master agreement with Amazon - a long, slow process when you are a public sector agency. When our new CIO mentioned 'cloud' I did a bit of digging and found out that Microsoft had added the phrase 'and Azure' to our master licensing agreement. Microsoft's foresight saved me months of contract negotiations. They made it trivial to set up an enterprise Azure account. So Azure became our default 'cloud'.

I had been running the typical nerdville home servers. Moving them from in-house Mac's to Linux in Azure was trivial - a weekend of messing around. I affirmed to our CIO that we had a fair number of apps that could be hosted in IaaS, and picked a couple of crash-test dummy apps for early migration.

Myself and one of my staff spent a few months creating and destroying various assets in Azure, and came to the conclusion that the barriers to cloud adoption would be found mostly in our own staff, and not the technology stack. Infrastructure staff would have to re-think their jobs and their roles in the organization, and development staff would have to re-think application design. Both would challenge the organization.

I also did a few quick-and-dirty demonstrations to get some ideas on how we might architect an enterprise framework for moving to Azure - such as hiding an Azure instance behind a firewall in our test lab to show that we could create virtual data centers that appeared to be in our RFC-1918 address space, but were actually in Azure IaaS. We also presented quite a bit of what we learned to our campus IT staff at various events and get-togethers, hoping to build a bit of momentum at the campuses.

On the down side, I ran into significant barriers within our own managers and their staff. A quorum of managers and staff were cloud-adverse and/or firmly committed to technologies and vendors that had no cloud play. We had to fight FuD from within.

Architecture

The Architecture activity was not successful. We had been running 'seat-of-the-pants' for years, resulting in many ad-hoc and orphaned tools, technologies and languages, and we were thinly staffed. So the idea that by adding rigor and overhead up front we'd end up with better technology that was less work to maintain was not well accepted. The entire concept of design first, then build was a tough sell, as the norm had been to start building first and figure out the design on the fly (if at all). Modern architectures such as presenting and API to our campuses were rejected outright. And of course the idea that two development teams or two infrastructure workgroups would agree on a tool, language, library - much less an architecture, was an even tougher sell.

The team (and any semblance of a formal architecture) was disbanded through attrition, and the body of standards, guidelines, processes, and practices are no doubt still in a SharePoint site, unmaintained and unloved.

Why did I leave when I did?

As time went on, I found myself in fundamental disagreement with how the organization treated its people. Leadership was making personnel decisions that I could not support, that caused the loss of several of our best people, and that placed other staff in places where they could not succeed or by happy.

That leadership would move staff into positions in which they had no interest, and do it without the concurrance of their manager (me) was unacceptable. To pile on work that was outside the core skillset of an employee, and then try to destroy their career when they were failing, is unacceptable. I don't want to work for an organization like that, and because of financial decisions I made years ago I do not have to work for an organization like that.

I did the math, got my ducks in a row, and retired.

My only regret is that I was unable to influence the disposition of the staff that I left behind.

Previous: Part 10 Leadership Chaos, Career derailed

Thirty-four Years in IT - Leadership Chaos, Career Derailed (Part 10)

2020-04-04T15:46:00.001-05:00

This post is the hardest one to write. I've been thinking about it for years without being able to put words to paper. With the COVID-19 stay-at-home directive, I can't procrastinate anymore, so here goes.

As outlined in Part 9, Fall 2011 was a tough period. To make it tougher, the CIO decided to hire two new leadership-level positions - a new CISO over the security group, and a new Associate Vice Chancellor (AVC) over the Infrastructure group. The infrastructure AVC would be my new boss.

The CISO position was really interesting to me. The infrastructure position was not as interesting, as it would have been more of the same but with more stress and more headaches. I applied, was interviewed and rejected for both. I'm sure that part of the problem was that with the chaos of our poorly written ERP application and the Oracle database issues that Fall, I really didn't prepare for either interview. Not having interviewed for a job in more than a decade didn't help either.

Both hires ended up being bad for the organization and for my career. I'm pretty sure that both knew that I had been a candidate for the positions and both were threatened by me.

The new CISO was determined to sideline me and break down the close cooperation between my team and the security team. Whereas we had been working together for years, the security team was now restricted from communicating with me without the new CISO's permission. I was blackballed - cut out of all security related incidents, conversations, and meetings. Anything that had my fingerprints on it was trashed, either literally or figuratively. Staff who had worked closely with me in the past were considered disloyal to him and were sidelined and harassed.

The new CISO also declared that we were 'too secure' and tried to get a consultant to write up a formal document to that effect. Whatever security related projects we had in the pipeline were killed off. Rigorous processes around firewall rules, server hardening and data center security were ignored. Security would no longer impact the ability to deploy technology.

The new Infrastructure AVC started out by pulling projects from me without telling me, meeting with my staff without me in the room and telling them I was 'doing it wrong'. Staff were still loyal to me and kept me informed as to what was transpiring. It was clear that I was viewed as a threat and was not welcome.

I confronted my new boss and advised that if he were going to manage my staff without me in the room, that he might as well move them directly under him on the org chart. He had a bit of a shocked look on his face, and then obliged. I also advised that as I now had no staff and no role in the organization, he needed to find me something to do.

I knew that he'd have a hard time firing me - I was protected by Civil Service rules, but I also knew that my work environment would be poor until either he and I figured out how to work together or one of us left. My choice was to try to stick it out and make the best of it or move on. I probably had options either within the State University system or with the State of Minnesota. I really am a Higher Ed. guy though, so I was reluctant to move. I decided to wait it out - and meanwhile get my financial ducks in order and put out job feelers.

He responded by blackballing me from any conversation of significance, by trashing me in e-mails to colleagues, by making it clear to my former staff that I was not to have any work related conversation with them without him, and that referencing anything that I had said or done the last dozen years was unwelcome. At one point I had to advise my former staff that they should not be seen with me, as it might impact their relationship with the new bosses. In an effort to convince me to leave (or perhaps out of sympathy), he even called me into his office and showed me a job posting at another State agency that he thought might be interesting to me.

He also moved me out of the IT area and across the hall into finance, where I would not be available to my former staff (and where I made a couple of great friends).

The environment was chaotic and toxic. Teams got rearranged and disrupted with no clear idea why or what outcome was expected. Moral was poor, tempers were high. A new director/manager was hired into my old position who was extremely toxic. As one could predict, some of our best staff left and others lost enthusiasm and dedication. I ended up fielding requests to be a job reference for many of my former staff.

After about six months he and I smoothed things out to the point where we could work together, as long as I stayed away from his (my former) staff and offered no thoughts on anything he was doing to anyone other than him. I had no clear responsibilities and as long as I stayed out of his sandbox I could do pretty much whatever I wanted. So I used that time to re-think quite a bit of what I had been doing, and in particular to lay groundwork for work that paid off a few years down the road, work that I'm quite proud of and will write about at a later date.

After about a year and a half we had another CIO change and both the CISO and AVC left. Ironically, on the AVC's last day I was the one who helped him clear his office, walked him out to the parking ramp and saw him off.

About that time the 'toxic' director also left. A couple of us who were black sheep ended up back in the thick of things when we found out how badly our technology and security had degraded. That's also when we found out that the 'completed' plans for moving a data center two months out did not exist.

The nightmare was over, but much damage had been done.

In retrospect, should I have left the organization? I'm not sure. For me it was very difficult to watch what myself and my team had built over the last fifteen years get torn apart, especially when what came out of the teardown was what I believed to be inferior to what we had. If what resulted was an improvement, it would have been easy. Very little of the technology that ran the system was built by anyone other than us. My fingerprint was on everything - good or bad, right or wrong. And to the new CISO and AVC, everything was bad and wrong.

But the data center got moved on time.

Part 9 - The Application that Almost Broke Me

Final - Why not Thirty-five?

Thirty-four years in IT - The Application That Almost Broke Me (Part 9)

2020-01-21T16:51:00.000-06:00

The last half of 2011 was for me an my team a really, really tough time.

As I hinted to in this post, by August 2011 we were buried in Oracle 11 & application performance problems. By the time we were back into a period of relative stability that December, we had:

Six Oracle Sev 1's open at once, the longest open for months. The six incidents were updated a combined total of 800 times before they finally were all resolved.
Multiple extended database outages, most during peak activity at the beginning of the semester.
Multiple 24-hour+ Oracle support calls.
An on-site Oracle engineer.
A corrupt on-disk database forcing a point-in-time recovery from backups of our student primary records/finance/payroll database.
Extended work hours and database patches and configuration changes more weekends than not.
A forced re-write of major sections of the application to mitigate extremely poor design choices.

The causes were several:

Our applications, in order to work around old RDB bugs, was deliberately coded with literal strings in queries instead of passing variables as parameters.
The application also carried large amounts of legacy code that scanned large, multi-million row database tables one row at a time, selecting each row in turn and performing operations on that row. Just like in the days of Hollerith cards.
The combination of literals and single-row queries resulted in the Oracle SGA shared pool becoming overrun with simple queries, each used only once, cached, and then discarded. At times we were hard-parsing many thousands of queries per second, each with a literal string in the query, and each referenced and executed exactly once.
A database engine that mutexed itself to death while trying to parse, insert and expire those queries from the SGA library cache.
Listener crashes that caused the app - lacking basic error handling - to fail and required an hour or so to recover.

Also:

We missed one required Solaris patch that may have impacted the database.
We likely were overrunning the interrupts and network stack on the E25k network cards and/or Solaris 10 drivers as we performed many thousands of trivial queries per second. This may have been the cause of our frequent listener crashes.

None of this was obvious from AWR's, and it was only after several outages and after we built tools to query the SGA that we saw where the problem might be. What finally got us going in a good direction was seeing a library cache with a few hundred thousand of queries like this:

select from student where student_id - '9876543';

select from student where student_id - '4982746';

select from student where student_id - '4890032';

select from student where student_id - '4566621';

[...]

Our app killed the database - primarily because of poor application design, but also because of Oracle bugs.

An analysis of the issue by and Oracle engineer, from one of the SR's:

... we have also identified another serious issue that is stemming from your application design using literals and is also a huge contributor to the fragmentation issues. There is one sql that is the same but only differs with literals and had 67,629 different versions in the shared pool.

Along with the poor application design, we also hit a handful of mutex-related bugs specific to 11.2.0.x that were related to applications with our particular design. We patched those as soon as we could. We also figured out that network cards on SPARC E25k's can only do about 50,000 interrupts per second, and that adding more network cards would finally resolve some of the issues we were having with the database listeners.

Pythian has a good description of a similar issue - which had it been written a year earlier, would have saved us a lot of pain.

Why didn't this happen on Oracle 10?

I suspect that in Oracle 10, the SGA size was physically limited and that the database engine just simple churned through literal queries, hard-parsed them, tossed them out of memory, and drove up the CPU. But it never ran into mutex issues. It was in 'hard-parse-hell' but other than high CPU, worked OK. In Oracle 11, the SGA must have ben significantly re-written, as it was clear that the SGA was allowed to grow very large in memory, which (by our analysis) resulted in many tens of thousands of queries in the SGA, being churned through at a rate of many thousands per second.

Along the way we also discovered COBOL programs that our system admins had been complaining about for 15 years - such as the program that scanned millions of individual records in the person table, one at a time, looking for who needs to get paid this week. Never mind that they could have answered that question with a single query. And of course the program did this scan twenty-six times, once for each pay period in the last year - just in case an old timecard had been modified.

Brutal.

I insisted that our developers re-code the worst parts of the application - arguing that any other fix would at best kick the can down the road.

In any case, by the time we reached our next peak load at semester start January '12, enough had been fix that the database ran fine - probably better than ever.

But it cost us dearly. We worked most weekends that fall to rushed changes/patches/re-configurations, one of my staff ended up in the hospital, and I aged 5 years in as many months.

In my next post I'll outline the other significant events in 2011/2012, which altered my job and forced me to re-evaluate my career.

Part 8 - Swimming with the Itanic
Part 10 - Leadership Chaos

Thirty-four years in IT - Swimming with the Itanic (Part 8)

2020-01-14T11:36:00.003-06:00

For historical reasons, we were a strong VMS shop. Before they imploded, Digital Equipment treated EDU's very kindly, offering extremely good pricing on software in exchange for hardware adoption. In essence, a college could get an unlimited right to use a whole suite of Digital Equipment software for a nominal annual fee, and Digital had a very complete software catalog. So starting in the early 1990's, our internally developed student records system (ERP) ended up on the VMS/VAX/RDB stack.

Digital imploded and got bought by Compaq, who got bought by HP, Somewhere along the line the RDB database line ended up at Oracle.

For most of our time on VMS & RDB we suffered from severe performance problems. Our failure in addressing the problems was two-fold - the infrastructure team didn't have good performance data to feed back to the developers, and the development team considered performance to be an infrastructure/hardware problem. This resulted in a series of frantic and extremely expensive scrambles to upgrade VAX/Alpha server hardware. It did not however, result in any significant effort to improve the application design.

Between 1993 and 2005, we cycled through each of:

Standalone VAX 4000's
Clustered AlphaServer 4100's
Standalone AlphaServer GS140's
Standalone AlphaServer GS160's

And of course mid-life upgrades to each platform.

Each upgrade cost $millions in hardware, and each upgrade only solved performance problems for a brief period of time. The GS160's lasted the longest and performed the best, but at an extremely high cost. At no point in time did we drill deeply into application architecture and determine where the performance problems originated.

During that time frame we got advice from Gartner that suggested that moving from VMS to Unix was desirable, but moving from RDB to Oracle was critical, as they did not expect Oracle to live up to their support commitments for the RDB database product. So in 2009 we moved from 35 individual RDB databases spread across four GS160's, to one Oracle 10G database on a Sun Microsystems E25k, in a single, extremely well implemented weekend-long database migration, platform migration, and 35:1 database merger. Kudos to the development team for pulling that off.

Unfortunately we carried forward large parts of the poor application design and transferred the performance problems from RDB to Oracle. At time though, the DBA's were part of my team. I had a very good Oracle DBA and Unix sysadmin, both of whom were able to dig into performance problems and communicate back to developers. We were pretty good at detailing the performance problems and offering remedies and suggested design changes.

Though performance slowly got better, the full impact of poor application design was yet to be felt.

As soon as the databases were combined and hosted on SPARC hardware, continuing with the GS160's made no sense. They were costing $600k/yr in hardware and software maintenance, now were significantly oversize, and were still running the dead-end OpenVMS operating system. This put us in a tough spot. The development team was focused on minimizing their commitment to any re-platforming and was only interested in a move from AlphaServer to Itanium. For me, Itanium (or Itanic, as I called it at the time) was a dead end, and our only move should be to Unix (Solaris). But because the cost to migrate to Itanic was much lower - the application would only have to be recompiled, not re-platformed - the Itanic advocates won the argument. We ended up purchasing Itanium blade servers at a 3-year cost roughly equal to 18 months of support on the GS160's.

By that time HP's support for OpenVMS had eroded badly. Support for Oracle clients, Java, and other commonly used software was poor or non-existent. That OpenVMS was dead was visible to all but the few for whom OpenVMS was a religious experience.

As we were bashing the decision around in 2009, I strongly suggested that if we purchased Itanium in we'd be on the dead-end OpenVMS platform for five more years. I was wrong. We were on Itanium AlphaServer blades and OpenVMS nine years, until 2018. The (only) good part of that decision was that the Itanium blade servers ran very well and were inexpensive to maintain. And as OpenVMS was pretty dead by then, we did not spend very much time on patches and upgrades, as few were forthcoming from HP.

This is a case where our reluctance to take on some short-term pain resulted in our having to maintain a dead-end obsolete system for many years.

Part 7 - Addressing Application
Part 9 - The Application that Almost Broke Me

Thirty-four Years in IT - Addressing Application Security (Part 7)

2020-01-03T14:16:00.000-06:00

In the 2008-2009 period, we finally started to seriously address application layer security in our development group.

By that time is was clear that the threat to hosted applications had moved up the stack, and that the center of gravity had shifted towards compromising the web applications rather that the hosting infrastructure. This meant that our applications, for which essentially no serious security related effort had been made, had to finally receive some attention. Our development teams were not tuned in to the security landscape and thus were paying scant attention to web application security. As our home-grown applications exposure to the Internet was mostly limited to simple, student facing functionality such as course registration and grading, the lack of attention was perceived as appropriate by all but a few of us infrastructure and security geeks.

In other words, the dev teams were at the unconscious/incompetent level of the Conscious Competence Matrix.

The catalyst for me was the planned deployment of functionality that permitted student aid and payroll-like functionality to be modified from an Internet-facing web application, including the ability to maintain direct deposit accounts (I.E banking information). The development team made a perfunctory risk analysis that recommended no significant changes to the current web application, but rather suggested that implementing Oracle TDE (database encryption) would somehow mitigate risk.

I found out what was being planned when my team was asked to implement TDE and our security team ask the obvious question - Why TDE, and why now? The answer - that application security was difficult, would delay the implementation, and therefor out of scope; and that TDE was all we needed - was unsettling, to say the least.

Myself and a couple of the security team members were shocked. At the time the web applications were protected by insignificant security controls, no identity protection whatsoever, 1995 appropriate authentication and authorization, and trivial database security. The web apps were written with no structure or standards for appdev security and the teams were mostly unaware of common attack vectors, SQL injection, XSS, or even the work done by OWASP. But our dev teams were intending to allow that application to modify direct deposit account information. And of course if the web application was not secure, TDE would not add any security either.

As I was in charge of the team that configured the load balancers and load balancer config was necessary to enable the deployed app to be visible on the Internet, I decided to block the deployment of the app and made it clear that I would stand firm until the application had been appropriately secured. I figured that either our CIO would tell me in writing to deploy the app, or I'd get sent to HR for insubordination, or I'd offer my resignation, or I'd bring Internal Audit into the loop and let them take the heat.

Myself and the security team laid out a minimum set of requirements for the application and hosting infrastructure and formally presented the requirements to leadership. The application was eventually re-written and deployed - two years later, after most of the recommendations were implemented. The fallout from mine & the security teams action was that we started a semi-serious app-sec program that resulted in app-sec training for all developers and the eventual implementation of basic application development security practices.

We moved the teams up a few levels in the Conscious Competence Matrix.

In retrospect, the disconnects were:

we had security and infrastructure teams that were very well acquainted with state-of-the-art threats and mitigations but unaware of our app dev practices
an application development team that was blissfully unaware of Internet-born threats and web-app security practices.
I ran the infrastructure (server, network database) teams and we worked closely with the Security team, but we both were somewhat isolated from application development. They didn't learn from us.
we in security and infrastructure had been detecting and cleaning up after web-based compromises for a decade - something to which the application developers had no exposure.
that dev teams were caught up in vendor propaganda that asserted that the Java/J2EE/Oracle stack was somehow inherently secure such that a sane SDLC was not necessary.

Part of the disconnect was leaderships reluctance to share security related information, particularly the information related to successful compromises of our colleges, other EDU's and other providers.

Part 6 - Building Out Disaster Recovery
Part 8 - Swimming with the Itanic

Thirty-four years in IT - Building out Disaster Recovery (Part 6)

2019-07-09T13:29:00.000-05:00

In the mid-2000's, our organization started to get serious about disaster recovery. By that time our core application was an e-learning application that was heavily used (a hundred thousand students on a typical day). That app became critical to our mission.

To bootstrap a DR capability we paid consultants for what was at best a craptastic DR plan. The plan was not implementable under any realistic scenario.

The consultants ignored our total lack of a DR site, insisted that we could buy servers overnight, and that because every server had its own tape drive, we could hire an army of techs from Geek Squad and recover all servers simultaneously from individual tape backups. Of course we had no failover site, no hardware, and we had tape-changers and a Legato infrastructure that streamed and interleaved multiple backups onto a single tape instead of individual tape drives in each server. I couldn't imagine buying dozens of servers and successfully recovering in any reasonable time frame. The consultants formally presented a 56 hour RTO to our Leadership, when my own gantt charts showed a 3-week RTO after we had a DR site leased, a data center network built, and hardware purchased and racked. So I pushed back hard - and stopped getting invited to the meetings.

They used nice fonts though. Give them credit for that.

After seeing where consultants were taking us, I pushed our organization toward full hardware and application redundancy and full data center failover capability for all data center hosted systems. My goal was to have two fully functional data centers, identically configured, identical hardware, full redundancy at the failover site, and near real-time data replication between them, all matched to realistic and achievable RPO and RTO. My rational was that an organization as small and under-resourced as ours would not be able to built, maintain, and routinely test a disaster recovery site that was not already built, running, and replicated; and that the failover hardware would be usable as pre-production, staging or some other purpose.

The way we accomplished this was to tackle the longest lead-time constraints first, starting with space. We learned that our partners at the State of Minnesota has several thousand square feet of data center space sitting empty - as they had just consolidated down to smaller mainframes. I offered to lease that space, and then worked with their electricians to preposition the correct power under the floor, having them build out PDUs and pigtails for the servers and storage that we'd parachute in if we had a disaster. That took care of the longest lead time items - space and power. We then built out a data center network - stubbed out at first, but eventually fully configured and routed to the backbone.

We then invested heavily in failover hardware.

The 'full failover' strategy meant that if a back end database required 'N' CPU's in production, we had to purchase and maintain '2N' in each of primary and secondary data centers, and in most cases a fraction of N in one or more QA and development instances. The QA and development instances were configured behind a fully redundant network stack that was used by the network team to QA network, firewall and load balancer technologies.

As we cycled through normal hardware replacement and rotation, we first filled out the failover data center with one-generation old hardware, figuring that half a loaf was better than none. Later we started buying and configuring identical hardware in both data centers - ideally upgrading failover first, so we never were in a spot where failover was behind production.

Hardware vendors loved us.

I felt strongly that if we didn't use the failover environment regularly, it would fall behind production and become unusable for failover - primarily because of configuration rot. This meant that wherever possible we needed to automate the configuration of devices and systems. It simply is not possible to ensure that two systems are identical in any case where they are manually configured. I.E. - you must have Structured System Management - scripts, not clicks. For UNIX systems this was fairly straight forward. For Windows, the options were few and painful. On the Windows systems we had far more clicks than scripts.

We achieved usable DR capability for our primary e-learning application in 2006, full capability for that application in about 2009, and for our ERP some years later. The team that ran the e-learning environment conducted a run-from-failover exercise annually, so we were assured that we could meet our published RPO and RTO for that application and its supporting technology.

Selling Disaster Recovery is hard. Most teams did not buy into the 'Failover is a first class citizen' mantra that I'd been preaching. For example, even though we had identical failover hardware for the ERP, the ERP team did not maintain failover in a fully configured state - often not even acknowledging the existence of the failover servers - and hence was not capable of conducting a failover within a reasonable RTO.

We did however - after a 6-month reconfiguration and testing effort - fail the ERP over to a new data center, so we knew it was possible. That effort required a reverse-engineering of an app (that we had written ourselves) sufficiently so we understood exactly how it was configured. We were then able to re-configure both production and failover identically, and successfully fail over the application. The team that ran the app didn't think it could be done. My team proved them wrong.

Part 5 - System Administration, Backups, and Data Centers

Part 7 - Addressing Application Security

Thirty-four years in IT - System Administration, Backups, and Data Centers (Part 5)

2019-07-09T11:48:00.003-05:00

As a side effect of building and running the backbone, I introduced UNIX systems into what was then a wholly VMS organization. We initially used Linux - roughly from 1994 - 1997, then over the next 20+ years, briefly migrated to Solaris x86, then to Solaris SPARC and back to Solaris x86/x64, and then back to Linux.

Our CIO at the time recognized that a pure VMS/RDB shop was not a valid long-term strategy and as a result had us host a UNIX/Oracle application on behalf of another organization as a part of building out a new capability that he recognized we'd need someday. As our VMS/RDB team didn't appreciate (or were genuinely hostile toward) non-VMS platforms, they declined to take on the building and management of UNIX/Oracle stack. So I and my team did.

My team eventually picked up responsibility for most of the enterprise wide application hosting. This included Windows and UNIX system administration, SQL Server and MySQL administration, and (eventually) Oracle database administration. As we started managing those technologies we went through a process similar to our early network management work, where we systematized and rationalized the servers and applications, brought them all up to common operating system versions and patch levels, consistently configured file systems, created common application installations, etc. In many cases the most simple straightforward path was to reinstall the apps on new operating system instances, then migrate the application data. Often times the servers and applications that we inherited from other teams had so much entropy that more than one fresh install was necessary - the first to learn how the application worked, the second to systematize and optimize the servers and applications for long-term reliability and maintenance.

Inheriting poorly managed, unreliable systems was something that we did several times. In every case, the team from which we inherited the technology or platform had trouble with the basics - knowing which servers were where, what OS version/patch was installed, knowing where and how they were backed up, etc. Step one for us was discovery. On day one I asked my staff to tell me where every inherited server was, what it did, and whether or not it was backed up.

My team also took on the responsibility for managing data center infrastructure. At the time, the infrastructure was all one-off, ad hoc, with little or no predictability or community between racks, servers, etc. We came up with simple data center network design, common standards for building and powering racks, racking servers, routing power, network and fiber channel, etc. At all times we emphasized redundancy consistency, and simplicity and structure.

At one point I also moved our enterprise backup system into my team. I felt that the team that had the responsibility for the RPO and RTO of the app should also have responsibility for backup and recovery. And of course I could sleep nights knowing that my team was running the backups. We redesigned the Legato Networker based system from the ground up, wrapped it in Perl scripts that covered us in places where Network fell short, and took on the painful task of managing tape-based backups.

To ensure that our backups were reliable, we preferred to incorporate data recovery into routine processes. For example, one of our apps needed a development database refreshed periodically. Even though we could have refreshed from production, we did not. Instead we refreshed from a randomly chosen recent backup and a randomly chosen point-in-time, thereby exercising full database recovery every few weeks or months. When we had instances where we had to perform point-in-time recovery for production systems, we were able to recover easily.

I also drove home the importance of recoverable backups by regularly, first thing in the morning, asking my sysadmins and DBA's how the backups went last night. I set an expectation that by the time I came in, they'd better know. To further emphasize the importance of backups, I used to ask my sysadmins to delete a mailbox or directory structure while I watched, and then recover whatever they'ed deleted using last nights backup. If they hesitated, I knew that they were not confident in their backups.

We also built robust remote management into every device in both the data center and backbone. Every serial console was attached to a network-enabled terminal server, every keyboard and monitor was attached to a network-enabled KVM, and every server chassis had its lights-out board fully functional. The network interfaces to the remote consoles were attached to partner networks - so that even if we borked up our data center or backbone network completely, we could probably recover it without going on site.

My goal was to minimized the necessity to visit the data center and maximized our ability to work remotely, including from home. Once we got fully remote-capable, we were able to perform major upgrades, database and server migrations and data center failovers while working remotely.

During the period when Solaris was making great strides in advancing the state of art in UNIX systems, we fully exercised the advanced feature of Solaris 10. We fully adopted ZFS, zones, live migrations, resource management, etc. For this I credit our top-notch lead UNIX sysadmin, whose skills are equal to anyone, anywhere. We also pushed Solaris hard enough to uncover a couple of catastrophic ZFS bugs - resulting in corrupted ZFS file systems and full point-in-time database recoveries.

Once Oracle bought Sun Microystems, I stopped investing in Solaris.

As a side-effect of hosting a particular application, we introduced content-aware load balancing. We ended up with NetScaler load balancers - which turned out to be a very good choice. We quickly implemented a standard that required all applications to be layer-4+ load balanced, even if they were single-server, non-redundant. The load balancers were implemented as reverse proxies with SSL termination, content awareness and URL filtering. Our goal was that no application or server could be visible to the Internet without a load balancer configuration to that application or server.

The load balancers therefor provided a strict control plane that managed access to the application and an extremely useful layer of abstraction and isolation between users and the server(s) that hosted the applications. At first - in the early 2000's - most applications balked at being hosted behind a proxy. We often had to reverse-engineer the vendor application sufficiently to make it work in our environment.

The combination of outbound default-deny on the data center firewalls and the reverse proxy layer were instrumental in helping secure the applications. In many cases we were able to analyze the latest vulnerabilities and determine that in our strictly controlled environment, the attack vector was not viable. That allowed us to be far more thoughtful and rational about when and how to accelerate patching and vulnerability management.

Part 4 - Security and Firewalling
Part 6 - Building out Disaster Recovery

Thirty-four years in IT - Security and firewalling (Part 4)

2019-07-03T19:05:00.002-05:00

As a natural fit with running the network my team took on the task of securing the campuses and data centers, starting with firewalling the data centers from the rest of the network. We started fairly simply by just segmenting enterprise-wide servers from networks with users and students and restricting unfettered access to enterprise servers, database and systems. This gave us the ability to control access to the core servers and systems. As expected, this initial segmentation was resisted most by the system managers and DBA's who managed the individual servers and databases. They were convinced that the only way they could possibly do their job was if they had full access to everything all the time from everywhere - even if they had no idea how they were accessing the system. This was a pretty typical attitude at the time, and to me an indicator that they didn't actually know how their systems worked.

In about 2001 we installed firewalls between each campus and the Internet. This firewall project was one of the first times that I got exposed to vendor and consultant FUD. We had vendors telling us that only certain models of firewall could actually secure our networks. We had consultants tell us that we were not capable of firewalling our own network and that only they had the necessary skills. Some of the consultants that were most closely working with various public sector agencies were also the ones that were most clearly full of FUD.

One of the leading service providers in the area offered that a minimum of $5000 per site and ongoing operation cost of $2500 per site per month would be required for the project. We had 55 sites so we'd have had a project cost of over $1.5M per year. Instead we decided to do the project internally with minimal consulting help and low-cost Cisco PIX firewalls. We hired a contractor to help us configure the first couple of firewalls, then rolled out and managed the firewalls ourselves. I wrote a series of shell scripts that would automatically configure firewalls, switches, routers and terminal servers. We came up with a standard campus network edge design that was detailed enough to specify every connection, port, and even the color of every cable. We went around that state working with campus staff to make sure that they understood how firewalls worked, and how to work with us to manage the firewall rules.

Total cost of the project ended up far less than budgeted and far less than any vendor or consultant. Years later we still had not spent as much as the first-year cost had we gone with vendors and consultants.

When we firewalled the campuses we also had to go against the common wisdom that educational institutions were impossible to firewall - a stance that some educational institutions maintained for many years. We obviously were able to prove conventional wisdom wrong. As described below, not only were we able to firewall campuses in 2001, we were also able to implement strict network segmentation policy in our data centers with outbound default deny as early as the mid-2000's.

Over time, we hired dedicated security staff, and with their help improved on the data center security model by segmenting servers within the data center based on the relative importance of the data. We built on that model for many years, eventually adding fine grained network segmentation, dedicated jump server networks, dedicated management networks, and dedicated console networks. The intent was to isolate data center networks from desktops as much as possible and to prevent propagation of security incidents through the data center networks and across unrelated applications. As the data center networks contained only known servers for known applications we were able to implement bidirectional 'default deny' network security policy. In other words, servers within the data center could not connect to addresses on the Internet unless it was specifically permitted by firewall policy.

The strict firewall policy mitigated many of the common attack vectors to which other organizations had succumbed. By restricting an applications ability to connect out to random Internet IP addresses, we also lessened our dependency on application security - something which applications were (and still are) notorious for failing.

We developed a strong operating principle that "If it can surf the Internet it can not be secured". In other words, when securing our applications and data we did not trust our own desktops. This principle was and still is validated by even the most trivial following of desktop and application security.

By following this principle we were able to move nearly all critical user data off of desktops and on to data center servers, where we felt reasonably confident in our ability to secure the data. We came up with a methods for allowing remote access to data center servers and applications from what were relatively insecure desktops. We were able to shut off all direct desktop access to all database listeners by installing all applications that required access to a database listener onto remotely accessible servers configured to run the desktop application and manage the data that would normally have been downloaded to desktop. The data never left the data centers, so it was relatively easy to secure vs. had it been downloaded to desktops.

This was fairly complex and expensive to run, as it required thorough understanding of exactly how every application and technology worked - in many cases something that not even vendors who wrote the application understood. We oftentimes ran into vendors who told us that their application or technology could not be firewalled, or that if we were to attempt to firewall it they would not support us, or they told us how to firewall the app or technology but they were wrong - they simply didn't know how their app or technology worked.

This also required a significant effort to work with users to convince them that the inconveniences of having to remotely access their data in the data centers reduced security risk enough that their obligation and responsibility towards the owners of the data would be well met. In most cases the user aspect of the problem was harder to solve than the technical aspect.

Part 3 - The System Office, Novell Directories and Building a State Backbone
Part 5 - System Administration, Backups and Data Centers

Thirty-four years in IT - The System Office, Novell Directories, and Building a State Backbone (Part 3)

2019-07-02T17:24:00.000-05:00

Unfortunately nearly all the work we put into administrative and academic technology had to be abandoned. As a part of a larger initiative across the state, the various colleges and universities were being merged together into a single system that today is know as Minnesota State. In that process our college president retired, and the new college leadership de-emphasized the use of technology In business practices. Additionally, I recognized that at merger time most of the software that I had written would not usable. So I spent some time getting us off the software I wrote and on to other software that I knew would be used post merger.

In a lot of ways that was a set back for both the college and the students. It was many years before faculty and students would have the functionality we had in 1993.

In the summer of 1995, as the colleges were being merged into what is now Minnesota State, I decided that since the college was not emphasizing technology as much as before and since I was ready to leave the small town environment, I would move to the system office and see what opportunities would be available there.

I became a Novell System Administrator and built out an NDS tree that turned out to be far more similar to modern directories then what It was to the best practices of the time. Conventional wisdom was that you divide up your directory by business unit or business function and create the user, printer file share, and other and directory objects related to the business unit in the OU for the unit. I quickly realized that conventional wisdom and the Novell Directory design guides were wrong. The 'organization hierarchy' style directory architecture didn't make a whole lot of sense, so I implemented a relatively flat directory where the only OU's were at a very high level and only used to categorize types of objects rather than business units, departments or divisions. The flat directory Is something much closer to what you'd see today and something like Azure Active Directory.

That directory had continuous up-time - I.E. you could log into the directory - for over a decade. Of course the servers providing services we not always up, but the directory was. :)

Conventional wisdom isn't always right.

While troubleshooting a statewide IPX routing problem I realized that In the post merger chaos, the wide area network was essentially unmanaged. There were routers all across the state on either 56 kbit circuits or T1's running along without any active oversight. So I grab the Cisco manuals, taught myself IP routing, asked a co-worker for the passwords, and started managing the wide area network.

That meant physically locating all the routers around the state, documenting circuit locations and circuit ID's, cleaning up routing tables, cleaning up configurations, and getting the hardware on common operating system versions. We also created a common campus network edge design, wrote our own database driven network monitoring package, combined monitoring and CMDB databases, and built out a system of thresholds, triggers and alerts to help us keep tabs on the hundreds of devices and dozens of sites. Written in Perl and leveraging MRTG, of course.

To help us monitor the network we push purchased Netscout probes for every site. Because we had more than 50 probes, and GUI's don't scale well, I dug into the Netscout management software and figured out that the graphical user interface was just a skin over the top of some very powerful command line programs. I wrote a package around those programs that automated the maintenance and monitoring of probes and automatically gathered 'Top N' IP, TCP host and port counters for their network traffic data from the probes. I exposed that data to campuses so that campuses had a pretty complete view into their network traffic readily available to them. My idea was that if they knew more about what was happening on their network, they'd be able to do a better job of running their network, and most importantly - they'd be able to resolve more of their own problems themselves. And call me less often.

As the wide-area network evolved and became more critical to our operation, we had to make significant investments in bandwidth and availability. Making those investments on our own without a partner turned out to be very difficult, so we partnered with the State of Minnesota to leverage our resources and their skills to build a common statewide backbone usable by all State agencies and our system - Minnesota State. For nearly twenty years, the State of Minnesota has been the backbone provider for Minnesota State Colleges and Universities. As the State had already partnered with the University of Minnesota, the three largest public entities in the state all share a common state backbone and Internet connection. In the partnership, Minnesota State benefits from the resources and expertise at the University of Minnesota and State of Minnesota, allowing us to concentrate our skills on other aspects of running a state-wide enterprise.

Prior: Part 1 and Part2.
Next: Part 4 -Thirty-four years - Security and firewalling

Thirty-four Years in IT - Networking and Software Development (Part 2)

2019-06-30T14:44:00.001-05:00

At the college we were extremely fortunate to have a president who had a very forward looking view of technology. In the mid 1980s he was already using personal computers regularly and had written some of his own software. Sometime around 1988 or so he described what he thought would be appropriate use of technology in education. He wanted all student records and curriculum to be electronic, all student testing to be electronic, and all grading to be electronic. He envisioned that students could walk up to a computer, login and access the curriculum, access and complete tests and quizzes, look up their progress toward graduation and any fees they may owe, and generate a transcript.

And of course he wanted it all tied together on a network.

About that time (1989) I was experimenting with RDBMS software (R-Base) to put together a simple system for recording student assignments and scores. The college had installed a couple of local area networks (Netware 2.0a on 286's, ARCNET and IBM baseband), and was starting the process of replacing an IBM System36 and RPG with Netware, DOS and Ansco Paradox 2.0. He saw my prototype and offered me a move from teaching to full time IT - in a department of one. Me.

So I built a shiny new campus-wide routed ARCNET network, built a couple of Netware 2.0a servers, and started writing the software that would execute his vision using the multi-user DOS-based Paradox 3.0 as the RDBMS development platform.

Within a few years we had a fully networked campus with nearly every computer and lab on the ARCNET network, multiple Novell servers, and real-time relational database software that covered most of the administrative and academic computing for the college.

I learned the fundamentals of managing relational data, normalization, foreign keys, indexes, etc. The college ended up with desktop software that managed student registration, fees, payments; quiz, assignment and test scores; course grades and academic transcripts. By the early '90's, students could walk up to specially configured PC's and any time and look up their grades and see exactly what they needed to do to complete a course, program, degree, certificate and graduate.

Everything was multi-user and real-time. If an instructor entered a test score, the student would see the score in real time. If the score was the final score needed to complete a credit, course or entire program, the student would see the final credit, course or program transcript seconds later.

No batch jobs. All DOS based and Netware networked.

One of the barriers in the pre-386 days was DOS's inability to multitask. That made some aspects of the software pretty difficult to use, and any task that required significant compute could not be backgrounded. The meant for example, that a registrar who needed to run a report for a student would have their PC tied up waiting for the report to complete. That slowed down the registration process significantly. I scratched my head a bit - multitasking OS's were not readily available. Quarterdeck DESKview and QEMM were fairly usable but affected desktop performance. Instead I came up with a novel solution that allowed background processing without affecting the performance of the users desktop.

I created a small multi-user relational database that acted as a job queue. When the real-time users executed functions that would likely take more than a couple of seconds, the application would insert the function & parameters into the 'job' database. A rack of old 286's were left running and logged into an app that would scan the job database. These 'workers' would scan the database, and if a record was unlocked, would lock the record, execute the function, return the results to either the student record RDBMS, a temporary table, or directly to the Registrar's printer, then unlock and archive job record. Then during peak processing times (registration week) I simply dusted off more old 286's and net-booted them into the worker app, where they would share the background processing load. That made the whole thing somewhat scalable.

I think it was called 'Distributed Network Computing' or something like that. But I used RDBMS record locks as the semaphore and database records as the messaging path.

After the student records database was roughed in and working, the President had me look at the purchasing system. We were using an archaic green-screen app hosted on a mainframe run by a service provider. The app was obviously ported from punch cards - the green screen input had to be column aligned, the results of a 'submit' would show up minutes or hours later, and any errors caused an entire batch to be rejected.

And of course all that did was update the chart of accounts. The actual PO's were done on a typewriter.

Brutal, considering what we had on the student records side.

We took a look at the 'modern' version of that providers software - a PC application that put the 80 column, position sensitive batch processing on a DOS green screen with a modem to submit the job to the mainframe. Errors were returned the next time you dialed up.

Brutal, considering what we had on the student records side.

The President asked my how long it would take to rough in something using Paradox RDBMS. I committed to a prototype in 3 weeks. In a handful of months we had a fully relational multi-user real-time purchase order and account management system where departments could key in their own PO's, accounting could approve and print them, chart of accounts and budgets would automatically be encumbered, and the service providers mainframe would be updated at the end of the day. Departments could look up their budget any time and see the balance and everything transaction affecting their accounts.

And of course the President had a small app that let him monitor budgets real-time, showing recent encumbrances bubbled to the top and highlighted red. He watched the budget like a hawk.

I eventually integrated our Financial Aid software - a commercial DBase based package. That integration allowed us to automatically determine how much of each students financial aid was surplus to their tuition and fees and automatically cut financial aid checks. Of course the first time we did the automatic check printing, we messed it up.

We also struggled with room scheduling, so I built a really simple room scheduling application using Paradox. The core of the application was a single table with building, room and 10-minute time block as a composite primary key. Checking room availability was a simple query, and Paradox's built-in key and record management prevented duplicate events.

By 1992-1993 I had automated most mundane paper-generating process and several data-entry jobs - which we eliminated through attrition.

Part 1- Instructor, Machinist, CNC and CAD/CAM.

Part 3 - Thirty-four years - The System Office, Novell Directories, and Building a State Backbone

Thirty-four Years in IT - Instructor, Machinist, CNC and CAD/CAM (Part 1)

2019-06-30T12:40:00.000-05:00

As I've now ended 34+ years of public service, I'm going to burn a few posts on where I've been and what I've tried to accomplish.

Like many people my age, my path toward a career in technology was non-linear. My first stop after a Baccalaureate in Physics was a move into teaching Machine Tool trades at a 2-year college. Make sense, right? Actually I had taken a few programming courses in college (FORTRAN, Pascal, PDP-8 Assembler, SNOBOL, FORTH), had worked my way through college as a machinist, and taught myself how to program CNC machines. So the trade school route wasn't too much of a stretch.

When I started teaching (in 1984…arghh…) the tools of choice for programming machine tools were either a Flexowriter, a Model 43 Teletype with a tape punch, or a really expensive CAD/CAM system. The CAD/CAM system that I inherited was a 1970's vintage TTY based system running on a Data General Nova 3. Input was a proprietary language entered via a TTY. Output was either a tape punch or a one-pen HP plotter. To boot the Data General you toggled switches on the front panel to fire off a couple of instructions that would boot from floppy. Note that there was no video/monitor/CRT. If you wanted to see your machine tool path, you put ink in the pen and plotted the path on 11 x 17 paper.

I convinced the college to move from that to a modern (for the day) PC based system that had a color monitor, 8" floppies and ran the UCSD P-System operating system. And that cost $25,000. The CAD side of the system could draw 3D wire frames and output to the CAM system. Students would learn manual G-code programming and a bit of drafting and CAD/CAM while programming and running their parts on the CNC. They also learned a bit about using a keyboard, digitizer and mouse, editing saving and retrieving text, managing files, etc. For nearly all students, this was their first exposure to computers.

The college was closely aligned with local business and industry, and as a result we developed a relationship with the tool room at the local 3M plant. They hired our graduates and advised us on curriculum. At the time, our CAD/CAM was far ahead of what the local plant had available to them. The tool room did pretty complex mold repair and a CNC, but with no reasonable means of programming one-off mold repairs, the machine was underutilized. As a favor to them I used the colleges CAD/CAM to generate machine tool paths for their CNC. The most complex and interesting was a machine tool path for one of 3M's most popular tape dispensers, generated using a ton of algebra that the CAD/CAM could convert to a tool path.

About that same time, our Drafting Department introduced AutoCAD on first-generation IBM PC's that still had the cassette port. At the time is was pretty primitive compared to the CAD/CAM that we had in the machine shop.

We also had a significant issue teaching students how to program and operate CNC machine tools. As those of you who have done any machine tool programming know, typographical errors can get really expensive really fast. Misplace a decimal point and you can crash machine and incur thousands of dollars worth of damage. Some schools tried to mitigate this by limiting students exposure to the CNC machines, something which I thought would defeat the point of hands-on vocational training.

After having to tear down CNC machine tool spindles, replace spindle bearings and re-align ways, I decided to go a different direction and try and come up with a way of proving that the students programs were valid and would not crash the machine. I did a little bit of experimenting, learned 'C', and wrote software that could read a machine tool G-code program and draw the three dimensional tool path on screen. One could also run the simulation from different 3D viewpoints, edit and save the G-code program, and download it to the CNC machine. We sold a few copies of that program to other schools - just about enough to pay for the ads that we ran in trade publications.

The available PC screens at that time were technologies like EGA, CGA, and Hercules graphics. The Microsoft 'C' compiler has libraries that helped manage graphics and isolate most of the low-level screen management from the developers, so I didn't have to use much assembler, and could reliably draw onto a canvas and let the libraries worry about the details of the graphics cards.

I also was frustrated at the quality of the CNC textbooks that were commercially available, so I started creating my own content. That content eventually evolved into a CNC textbook that was published by one of the major textbook publishers. Writing a textbook, drawing the line art in AutoCAD, and having a friend take all the photos was a great learning experience. The text was somewhat successful. Even after 27 years you can still buy a copy of the book on Amazon - though the published dumped all their copies years ago. Seems like some things never die.

The college's teaching and learning methodology was heavily dependent on both written and video content. To facilitate creating written content, I started using Desktop Publishing systems - really just a PC with a decent monitor and GUI editing software. Once word processors were reasonably capable, the dedicated desktop publishing systems were overkill so I stopped using them.

By the time I had the textbook ready to publish, I had plenty of experience with desktop publishing, word processing and CAD/CAM. I offered to provide the text and line art electronically, but the publisher clearly wasn't going for that, so I supplied a stack of printed paper, plotted line art, and 8 x 11 photos. The publisher handed the stack off to an independent contractor who started to key and scan the whole mess into a desktop publishing package. I contacted her directly and sent her a few floppies. She thanked me.

I enjoyed teaching - at least for a while. I think my proudest moment was when a student of mine came back a few years after graduation, had started his own mold-making shop, hired a handful of people and offered that 'anytime I needed a job, he'd hire me'. Having helped bootstrap him and many others was very rewarding.

Part 2 - Networks and Software Development

On aging (software)

2018-10-27T10:56:00.000-05:00

I’m looking at an old (early 20th century) hand-crank record player that was handed down to me from my great-grandmother. It’s a simple wooden box with a spring & flywheel mechanism that spins the turntable at a somewhat constant speed, a metal needle that rides in the grooves of a record disk and transmits the vibrations onto a small metal drum, and a big metal horn that focuses the sound from the drum and directs it out into the room. The power source is a human winding a spring. The sound and amplification are purely mechanical.

It’s simple. If it breaks you can take it apart, look at what’s inside, and with just a bit of tinkering you’ll probably get it to work again. If it somehow survives a few thousand years, the people from that era will look at it, figure out what is is supposed to do and with a bit of tinkering it’ll be made to work again. If one were to draw the mechanicals out on archival paper, a person from the future would be able to create a functioning replica using only late 19th/early 20th century technology.

The modern equivalent? Instead of wood and metal that with maintenance and preservation can be made to last centuries or can be recreated from scratch with a bit of work, we have disposable silicon & proprietary software – with no realistic means of preserving or maintaining over a long period of time.

For me, the non-maintainability and short life of software-centric devices affects how I view long term purchases. I shun software dependency (and internet connectivity) on devices that I expect to last more than a few years. For example - my stove, refrigerator, espresso machine, clothes washer and dryer are all more than a decade old and perfectly functional. There is no software to speak of, & the electronic and mechanical parts are still available and replaceable. Likewise, I have two cars that are approaching 15 years old and still working fine. Neither has ever had an issue with the ‘blacks boxes’ on which they are somewhat dependent, neither has had a software-related defect, and if I keep doing basic maintenance and can keep them from rusting, both could be made to last a few more decades.

On the other hand, I have two other vehicles that are new and heavily dependent on software – so much so that in three years I’ve had one recall for a software bug that would have been pretty significant had I hit it while going down the road, and one software related recall that would have toasted my 4wd transfer case had I hit the bug under the right conditions. I also have a vehicle that could potentially last decades, but because some of it’s functionality is tied into current generation smartphones, I expect that if it’s still around I’ll see those functions stop working at some point in time.

There are two factors that I consider when looking at long-term purchases. First, software is always broken. It’s broken when it ships, it’s broken every time you use it, and it’ll still be broken a decade from now when you need it to make your car or appliance work. All software has bugs, and many of those bugs are not uncovered until long after the vendor stopped maintaining the software. Decade-old bugs are found pretty regularly, and few vendors maintain the ability to re-compile and distribute updates for decade-old devices. Any functionality that is dependent on either of the major smartphone OS’s is also time-limited, as vendors simply don’t maintain software compatibility beyond a few OS versions. Nor can we expect that when Android and IoS no longer dominate the smartphone market like Blackberry and Symbian once did, that vendors will forward port functionality from decade-old devices into the then-current OS.

That effectively sets an upper limit on the long-term viability of any software-dependent device, and unlike electro-mechanical devices, the prognosis for long term repair/maintenance is pretty poor.

Like it or not, we are in an era where much of what we have will be lost - not because of rust and decay - but because we will loose the means of maintaining the software that powers the present.

Blog: Resurrect or Die?

2015-11-07T10:07:00.001-06:00

This blog has been idle since 2012. Does anyone care?

Like many, I let this blog die. I think that’s happened for a variety of reasons, both personal and professional.

Relevance: Most of what I was posting was clearly not going to make any difference to myself or anyone else. Posts that announce [random software] has [random vulnerability] could just as well have been machine generated. The state of software security is at best, marginally better than it was a handful of years ago, and some random guy blogging about it isn’t gong to change that.

A few posts were and are still relevant. No matter what I do, I’ll maintain access to those.

Burnout: Having more than one open Oracle Sev 1’s per team member was brutal. We were in a situation where a small handful of us were working a seven-day pace for months. A few months of that and I was more than ready to back off from my immersion in technology and recuperate a bit. Hobbies got more interesting, and being able to disconnect from 24x7 operational responsibilities was essential.

Chaos: Having multiple layers of leadership turnover simultaneously created a work environment that was unpredictable and chaotic. Uncertainty affected staff moral, working environment went from stressful but fun, to simply stressful. I got a new boss, and along with that a significant job change.

Blogging is dead: Of the hundred-odd blogs that were in my RSS feed, only a handful are being maintained. The center of gravity has shifted. I miss following bloggers – as well formed, thoughtful ‘long form’ writing still interests me far more than short, disconnected 140 character messages.

So – let die gracefully, resurrect, or something in between?

--Mike

The very four digits that Amazon considers unimportant...

2012-08-07T19:29:00.001-05:00

"The very four digits that Amazon considers unimportant enough to display in the clear on the Web are precisely the same ones that Apple considers secure enough to perform identity verification..." Honan wrote.

Four digits, when combined with my home address and bank account number were all it took for me to gain on line access to a dormant checking account at my bank and enable fund transfers. If I were fond of the various auto-pay options, there would be a dozen or so companies that would have my checking account number, any pretty much anyone in the world can find out my home address (I own a house, so it's in various public records).

Segmenting ones on line life into non-overlapping buckets seems like the best way to break the daisy chain that led to the hack and data loss. I've followed that principle. I try to maintain separate, non-overlapping e-mail addresses and passwords for any on line account that either is connected to something that could cost me money if it were compromised, or is used for account verification for any of those accounts.

I have lots of e-mail accounts and addresses. It's a pain in the azz, and it's only a partial solution.

Read: ARS, Wired

More thoughts here.

In MumbleWare versions 8.2 and below, the SA password must be set to propq

2012-07-06T22:04:00.003-05:00

An e-mail from a vendor, somewhat anonymized:

From: ****
Sent: Wednesday, April 11, 2012 08:22 AM
To: ****
Subject: MumbleWare Case 123456789

Hello ****,
Thank you for contacting MumbleWare Product Support. I am writing to you in reference to case number 123456789 regarding your request to change your SA password. In MumbleWare versions 8.2 and below, the SA password must be set to propq. If a different password is used, MumbleWare may not be able to communicate with the database and error messages will be generated. The attached KB article references the fact that the SA password must be set to propq in MumbleWare versions 8.2 and below. The second KB article lists the steps involved in moving from MumbleWare 8.2 to 8.3.

It's only been a decade since we first asked the obvious question "Can we change our SQL Server SA password without breaking your application".

I guess we finally can.

A letter to our Apple Account Exec

2012-05-12T07:16:00.000-05:00

A couple of days ago myself and a colleague of mine ran into our Apple account exec. The conversation ended up in the security space, as is probably appropriate considering Apples recent performance in that area. Our account exec quickly followed up with a request for our contact information (good), a press-release style announcement on how much more secure Safari 5.1.7 was going to be (interesting), and a month old article on how to remove Flashback (amusing).

I figured he was missing the point of our conversation. Here's my reply:

Thursday, May 10, 2012 8:31 AM

***** -

The context of our conversation was really strategic, not tactical. The short term issue of a specific malware incident isn't important. (We knew about Flashback and how to remove it shortly after it was discovered.)

What is an interesting discussion is Apple's strategic, corporate wide attitude towards enterprise desktop security and desktop management and the question of whether or not Apple, as a corporation, will step up to the plate with world class proactive and reactive management of the security of OS X when they are subjected to the same sort of focus from dedicated, highly capable attackers that MS has been subjected to the last decade or so.

The things on my radar:

- Apple is consistently slow to patch compared to their peers. They were last to the plate with the latest Java fix. That's not a good sign.

- Apple still asserts that they are 'more secure' than their peers, yet offers no specific technical backing for the assertions. That's not a good sign.

- In past security performance, OS X has fared well compared to its peers. However it doesn't appear as though its performance is due to superiority in design or execution. OS X has fallen first and hardest at browser hacking contests over the last handful of years, an indication that there is no inherent superiority to OS X in either design or execution. Apples past performance is likely good because nobody bothered to attack them. That has/is/will change. Apples ability to manage itself when it is the target of the worlds best hackers is untested.

- Apple fumbled badly when they did have a major incident (the delayed response and the number of patches that it took for them to clean up the last malware incident.) That's an indicator of immaturity in the general space of incident handling.

- Apple insists on mixing routine bug fixes with security updates (today's patch, for example, is both security and bug fix). We prefer to be able to separate patches which are security only (we can rush them out) with patches that may affect the stability of existing applications (we can test them thoroughly). That's really a best practice across any system.

- The short product support cycle of OS X. Apples peers provide security patches for operating systems that are quite old in comparison to Apple. An OS version that is not supported by the manufacturer with security updates for roughly 5 years after last customer ship is hard to manage in the enterprise space. Unless that changes, we'll have lots of unsupported, insecure OS X installs years from now, as support will have ended when the system is still in use.

- I obviously don't have any inside knowledge of OS X browser or kernel, but the fact that I have to re-boot the kernel when updating the browser is an indicator that the browser is tightly coupled to the kernel (for good reason, no doubt) but also is an indicator that any vulnerabilities in the browser have a fair chance of affecting the kernel. Recent browser-cracking competitions have shown that to be true, AFAIK.

- A really dumb, brain dead mistake like the latest 'store passwords in the clear on when upgrading an encrypted file system' is an indicator of immature processes in Apples internal software development and deployment. If that happens more than once, not only will we have an indicator of immaturity, but we'll also have an indicator that Apple can't learn from its mistakes. (Adobe, for example, would be an example of a company that seems to make the same mistakes over & over again.)

Again - specific incidents are not really interesting unless I perceive them as indicators of larger, more persistent problems.

I have to close up my 11" Air home computer, buzz into work and light up my 11" Air work computer. ;-)

--Mike

I don't know if Apple is ten years behind Microsoft on desktop security or not. I'm pretty sure though, that there is nothing about OS X (or any other desktop operating system) that is inherently superior such that we can afford to ignore the fundamentals desktop security. This exploit, for example, is platform neutral.

Apple has joined the big leagues. We'll soon find out how well they play.

I'm also pretty sure that even if we do rigorously follow security best practices, we'll still be doing our banking from botted desktops.

If it can surf the Internet, it cannot be secured.

OT: A plan.

2012-05-01T20:13:00.001-05:00

Aaron Smith posted this story about the kindness of an NYC cab driver. It's a good read, and it reminds me of something vaguely similar that happened to be a few decades ago.

I had just moved 400 miles from home to a small town in Minnesota near where my grandfathers sister had moved in the 1930's. He didn't get to see her very often, so when I moved near her farm he had an excuse to make the trip.

The house I bought needed a ton of work, my youngest brother needed an excuse to skip high school, my grandfather needed a ride to Minnesota, so I ended up with a couple hard working helpers once weekend or so per month. Good deal for me.

One weekend my grandfather insisted on coming out to Minnesota. He had just been out a few weeks earlier, I didn't need any help and my brother wasn't enthused about another road trip.

He insisted.

They made the trip.

While he was helping me strip wallpaper that weekend, I noticed that he was hitting the nitro pills pretty regularly. He'd had a heart attack about ten years earlier but all things considered appeared to be in fairly good condition. I asked him about the nitro but didn't get much of a response.

As he normally did, on that weekend he visited the sister that lived nearby. On the way home though, he asked my brother to detour a couple hours out of the way to visit his other sister. That was new.

The day after he got home from the road trip he went outside and did the one thing that he'd not done since his heart attack. He split wood - not with a powered splitter, but with an ax, the old fashioned way, just as he'd done from the time he was a kid up until his heart attack a decade ago

He died splitting that wood.

I'm pretty sure that he had a plan.

Apple joins the big leagues

2012-04-15T10:24:00.001-05:00

I've been hearing 'OS X is secure' for a decade now. For a decade, I've been challenging that assertion.

The challenges to that assertion generally end up with a response of 'because it's Unix' or 'because it's not Microsoft'. I don't recall 'OS X is secure' assertions being backed up by detailed explanations of anything in the kernel, operating system, development tools or coding practices that assures a higher level of security than competing operating systems, and I don't hold that a Unix history automatically ensures a more secure platform. My first forensic examinations were Unix, not Windows, and I can easily assert that the reason that we have more compromised Windows servers and desktops is because we have more Windows servers and desktops.

Unfortunately the 'OS X is more secure' fantasy has left some (or many) with the impression that they don't need to practice safe computing on Macs. It is OK to run as admin. Anti-virus is not necessary. Drivebys are a Microsoft problem. In my opinion the smoke and mirrors surrounding 'OS X is secure' have also lead to complacency on Apples part. They are not as aggressive at implementing security related operating system improvements (such as ASLR) or routine security patches, nor have they implemented really the really basic security controls that I implemented more than twenty years ago on our NetWare servers (remove the execute permission from directories that contain user data, remove the create/write permission from directories that contain executable code). With the latest attacks on OS X applications and with Apples apparent inability to defend its operating system against drive-by vulnerabilities in third party software, the 'OS X is secure' attitude ~~should~~ must change. A half million users can't be wrong, and those users will eventually move past their denial phase and expect Apple to step up to the plate.

Apple will have to up their game a bit on incident response, too. An urgent fix for a months-old vulnerability followed by a fast tracked effort to provide a malware removal tool, resulting in three updates in ten days, doesn't leave me with the impression that they have a well oiled response machine. Apple will feel heat that has been directed at Microsoft the last decade (and Unix systems before that.) Hopefully they will learn from their competitors and react to the new landscape better and faster than their peers did.

Apple can't blame Sun either. The vulnerability of Java is well known (as are the vulnerabilities of Flash, Reader, Safari, Firefox…). Apple also has had plenty of opportunity to learn from their own mistakes, having repeatedly offered multiple versions of vulnerable desktop software to their customers.

I figure that it'd be pretty boring surfing the web with a platform that isn't exposed to drivebys and remote root exploits so I never really embraced OS X as my preferred home desktop. Now that OS X is playing in the big leagues I figure that it is sufficiently challenging for me to use it as my preferred desktop, and I went out and bought an 11" Air for my home computer.

Update 2012-05-11: Apple accidentally logs passwords in clear text. In football (soccer) that would be an "own goal". A major league fail.

Twenty percent of all households have at least one bot-infected computer

2012-03-29T07:02:00.001-05:00

...and 5% of all enterprise 'assets' are infected.

From Gunter Ollmann, VP of Research at Damballa in this post on CircleID:

"...on average, between 3-7% of assets within enterprise networks are identified as being infected..."

"Within the ISP/Telco world that have chosen to deploy the Damballa CSP product, between 18-22% of unique subscriber IP addresses are actively seeking to connect to known C&C servers."

Ouch.

Note that this is bot-net infections only, not the broader category of computers infected with malware in general.

When I first started securing systems a couple decades ago there were no external threats. We had Netware, IPX and Arcnet. The only path to a compromise of confidentiality or integrity originated on a keyboard within the campus. There were no external threats. The threat to our systems was from the inside, and the risk from insiders was mitigated by the assumption that we'd be able to pin the actions initiated a keyboard inside our buildings to an individual and that the individual would know that the actions would be traceable. It wasn't foolproof - you routinely read about employees misappropriating employers funds - but as far as I know, it was a manageable problem.

Then we connected our wonderful safe little island to the Internet. It didn't take long to figure out that an action by an outsider, external to our island, was a threat to our systems. The solution? Firewalls, of course. If the outsider can't get in, we can focus on the threat from the inside where we know who is at the keyboard, where they know that we know, and where they know that detection and prosecution is a likely outcome.

Today? Unlike years ago, we cannot associate the actions of a keyboard with the individual sitting at the keyboard. This effectively means that what used to be external is now internal, and what has always been internal is now external. What used to be a fairly clear delineation between something that happened from the outside and something that happened internally is gone. We no longer can assert that we know who is at any particular keyboard, and tracing an event back to an internal keyboard doesn't permit us to presume that the action was initiated by a person internal to the organization.

The external threat is inside your enterprise.

Micrsoft and its partners seize servers...

2012-03-26T19:45:00.004-05:00

Microsoft press release on their Zeus botnet server seizure:

"This disruption was made possible through a successful pleading before the U.S. District Court for the Eastern District of New York, which allowed Microsoft and its partners to conduct a coordinated seizure of command-and-control servers running some of the worst known Zeus botnets."

"As a part of the operation, on March 23, Microsoft and its co-plaintiffs, escorted by the U.S. Marshals, seized command and control servers in two hosting locations, Scranton, Pa., and Lombard, Ill., to seize and preserve valuable data and virtual evidence from the botnets for the case."

Emphasis is mine.

From the actual seizure order:

"There is good cause to believe that the Defendants have engaged in…Trademark Infringement, False Destination Origin, and Trademark Dilution…"

Emphasis is mine.

So if I'm reading this correctly, Microsoft seized the servers, not federal law enforcement. Individuals who work for a corporation, not law enforcement agents who report to elected officials, executed the seizure. A corporation has, with the permission of a court and while escorted by law enforcement, seized property using (amount other things) Trademark Infringement as a justification.

Kudus to Microsoft for taking bold action. A large corporation like Microsoft can put far more resources into something like this than law enforcement. (The best funded crime lab in my home state is at the home offices of a large nationwide retailer, not at a government facility.)

But we should stop and consider if we really want corporations leading a law enforcement action.

I thought I had this privacy thing figured out, but…

2012-03-08T19:53:00.002-06:00

…maybe not.

I’m trying out the Collusion plugin for Firefox and the results are interesting. After a couple evenings of my normal surfing routine, the plugin looks like:

Yuk.

As expected, Google appears at or near the center of attraction.

I use the Google suite for anything related to my profession and I use Google’s competition for anything unrelated to my role as an IT professional. My theory is that as a public employee in Minnesota, pretty much everything I do professionally is public anyway, so I figure that there is no net loss to using the Google stack. The Collusion plugin shows that I’m merging the two realms far more than I thought.

Also unexpected are several domains that I’ve never heard of, including something called imrworldwide:

I have no idea who they are, but they know more about me than I’d like.

I use Adblock Plus and NoScript plugins and I accept third party cookies, but I clear all cookies each time I close Firefox (once every few weeks), so I’ve assumed that I’m less ‘connectable’ than the typical surfer.

It looks like I’m not as segmented as I thought. I’ve added ‘Antisocial’ and ‘Adversity’ block lists to Adblock Plus.

I’ve always wondered how many vulnerable devices

2012-02-16T07:25:00.001-06:00

…

are out there.

Now I know.

Oracle Support portal: HTML 5 replaces Flash

2012-01-30T20:28:00.001-06:00

Oracle Support is upgrading their web interface from Flash to HTML5. I’m happy. I no longer have to twiddle my thumbs waiting for Flash to load:

That was really annoying. The consolation prize was that the Flash UI was still two orders of magnitude faster than the call back from support on a Sev 1, so the Flash interface really didn’t affect MTTR.

My major complaints about the Flash interface were:

Managing Flash plugins on critical data center servers & management infrastructure. Adobe simply has not been able to keep Flash from being exploited, so having to rely on an exploitable plugin for daily operations never made me comfortable. It is really nice to be able to gather data on an incident and upload it directly to Oracle but that meant that the database management infrastructure had to have Flash plugins along with the associated risk/cost of an exploitable plugin.

Slow and unreliable. When I log into the Flash based support site, I typically need to reload the Flash app at least once, usually at the 90% marker. The new HTML5 interface is faster than Flash and doesn’t hang on startup.

Not tab aware. What could be more natural than opening up multiple SRs at once, each in their own tab? How about being able to search & opening up each result in a separate tab? Or being able to put an SR and its associated bugs side by side? The Flash UI couldn’t handle more than one tab. It excelled at making every interaction with the interface strictly linear.

Unfortunately what’s out there today still isn’t tab-aware. In IE I don’t get a right-mouse menu at all and if I try opening up new tabs on Firefox, I end up with:

However – if I’m viewing an SR and I right-click on the printer icon, I can display the SR in a standalone tab. That helps. I still can’t open up an SR alongside it’s associated bugs though.

I suspect that Oracles lead UI designers are constrained by strict linear thinking. It probably never occurs to them that a user might work on more than one problem at a time or that a user might want to view both SRs an bugs at the same time. Or maybe Oracle has a corporate policy that prohibits two-button mice and browsers with tabs.

FWIW - In the process of playing with tabs, I also ended up here:

Amusing.