Non-functional Requirement - Software Resiliency

Category: Resiliency

Context: Software

Goals: When an operating system or run time instance fails, the recovery of the failed component must be treatable as routine system maintenance rather than as a service affecting outage or emergency.

Rationale: If the availability of the system is sufficiently critical, the MTTR must not be dependent on the response time of staff, manual recovery processes, or the clock time required diagnosing the failure. The availability of the system therefore must be decoupled from the availability of any single software component or any individual staff person.

Requirement: Failure of a single operating system or runtime instance shall not cause user detectable loss of business functionality for an elapsed time greater than Metric. After an elapsed time no longer than Metric, the user will be able to continue business functionality.

Metric:

Level A:

A1. The user detectable loss of business functionality will be no more than one minute
A2. The user will receive a visual indicator of the status of the in-flight transaction
A3. Business functionality will be available to the user without re-authentication
A4. The user application context will be preserved, restored or recovered

Level B:

B1. The user detectable loss of business functionality will be no more than ten minutes
B2. No more than the single most recent in-flight transaction will be lost
B3. Business functionality will be available to the user after re-authentication
B4. The system will continue to meet non-functional requirements other than resiliency requirements.

Level C:

C1. The user detectable loss of business functionality will be no more than one business day
C2. No more than the most recent one business day of data modifications will be lost

Level D:

D1. The recovered system will meet all pre-failure functional and non-functional requirements.

Scale: Duration

Implications: If this requirement is not met, the organization will incur decreased availability of systems, decreased flexibility for hosting and system management, and increased frequency and duration of unplanned outages.

Applicability: See Enterprise Requirements Framework

Tags: Resiliency, Software

Status: Approved, Requirement

Author: <Author>

Revision: <Revision>
 
Note: 

Incorporates traditional concepts of Redundancy, Clustering, Load Balancing and Fault Tolerance. A systems 'Availability', RPO and RTO are derived from this and other requirements. 

In general, the designer should consider Resiliency – Software, and Resiliency – Hardware NFR’s as a unit and engineer for both NFR’s in concert. In particular, the software must be designed so as to gracefully manage both software and hardware failures using robust transaction management and error handling. Failure modes and failure domains must be well understood.

For more information, see NFR Summary