Skip to main content

Wide Area Network Outage Analysis

The following is an brief analysis of unplanned network outages on a large state wide network with approximately 70 sites at bandwidths from DS3 to GigE. The data might be interesting to persons who need to estimate expected availability of wide area networks.

The network is standard core, hub, spoke, leaf. The core is fully redundant. The hubs have redundant circuits connecting to multiple hubs or cores, redundant power and partially redundant hardware. The spokes and leaf sites are non-redundant. 

The source or the data was a shared calendar where outages were recorded as calendar events. The data was gathered at analysis time and is subject to omissions and misinterpretation. Errors likely are undercounts.

Raw data, by approximate cause

  • 88 Total Outages
  • 290 Total Hours of Outage
  • 2 years calendar time
Failures by type and duration

Cause
# of Incidents
Percent
# of Hours
Percent
Circuit Failures
34
39%
168
58%
Equip Failures
24
66%
60
79%
Power Failures
22
91%
53
97%
Unknown
5
97%
7
99%
Other
3
100%
2
100%
         
Total
88
290
 

Column Definitions
# of Incidents = Raw count of outages affecting one or more sites
# of Hours = Sum of duration of outages affecting one or more sites
Percent = Cumulative Percentage of corresponding column

Cause Definitions
Circuit Failures = Failures determined to be circuit related, primarily fiber cuts
Equip Failures = Failures determined to be router, firewall or similar
Power Failures = Failures where site power was cause of outage
Unknown = Failure cause undetermined, missing information
Other = All other failures

Pareto Chart - Number of Incidents

A visual representation of the failures shows causes by number of outages. If I remember my statistical process control training from 20 years ago, a Pareto chart is the correct representation for this type of data. The chart shows outage cause on the X-axis, outage count on the left Y-axis and cumulative percent of outages on the right Y-axis.
Outages-Incidents
Using the Pareto 80/20 rule, solving circuit failure resolves 40% of outages by count. Solving equipment failures resolves another 25%. Solving power failures resolves another 25% of outages.

Power failures are probably the least costly to resolve. Long running UPS's are inexpensive. The individual sites supply power and UPS for network equipment at the leaf sites. The sites have variable configurations for power and UPS run times. The area has frequent severe weather, so power outages are common.

Circuit failures are the most expensive to solve. Circuits have high on going costs compared to hardware. The sites are already configured with the lowest cost available carrier, so redundant or protected circuits tend to be more costly than the primary circuit. Circuit failures also appear to be more frequent in areas with rapid housing growth, construction and related activity. For fiber paths provisioned above ground, storm related failures are common.

Pareto Chart - Hours of Outage

A representation of total outage duration in hours by cause is also interesting.
Outages-Hours
When considering the total number of hours without service, the causes occur in the same relative order. Solving circuit failures resolves 60% of the total outage hours. Circuit outages have a disproportionate share of total outage duration, likely because circuit failures take longer to resolve (MTTR is higher).

Availability Calculations

The network is composed of approximately 70 sites (the number varies over time). The time frame of the raw data is approximately two years. The numbers are approximations.

Outage Frequency:

  • 70 sites * 2 years = 140 site-years.
  • 88 outages /140 site-years = .6 outages/year.
  • 140 site-years / 88 outages = 1.6 years MTBF
Sites should expect to have slightly less than one unplanned outage per year on average, over time. Caution is advised, as the nature of this calculation precludes using it to predict the availability of a specific site.

Outage Duration:

Availability is calculated simply as

(Hours Actually Available)/(Hours Possibly Available)
  • 70 sites * 2 years * 8760 hours/year = 1.23m Hours possible
  • 1.23m hours -288 hours = Hours actually available
  • (1.23m hours -288 hours )/(1.23m hours )= 99.95% availability.
Availability on average should be three nines or better.

This syncs up fairly well with what we've intuitively observed for sites with non-redundant networks. Our seat of the pants rule is that a non-redundant site should expect about 8 hours unplanned outage per year. We assume that Murphy's Law will make the failure on the most critical day of the year, and we expect that areas with rapid housing development or construction will have more failures.

This also is consistent with service provider SLA’s. In most cases, our providers offer 99.9% availability SLA’s on non-redundant, non-protected circuits.

A uniquely regional anomaly is the seasonal construction patterns in the area. Frost depth makes most underground construction cost prohibitive for 5 months of the year, so construction related outages tend to be seasonal.

The caveat of course, is that some sites may experience much higher or lower availability than other sites.

Related posts: Estimating the Availability of Simple Systems

Comments

  1. BTW, Michael,

    I saw this today and thought of this post:

    http://perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

    Good numbers regarding failure per machine at Google. I didn't know if you'd seen it or not.

    ReplyDelete
  2. Matt - I saw that, and was surprised at how high the failure rate was. They have the 'smart software, cheap hardware' mantra though, so it must work out.

    Our software isn't smart. ;)

    ReplyDelete
  3. I can't afford the volume of hardware they must order at a time to get it as cheap as they do, so it wouldn't work for me anyway. :-)

    ReplyDelete

Post a Comment

Popular posts from this blog

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:Failing to understand the difference between necessary and sufficient. A daily backup is necessary, b…

Ad-Hoc Verses Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to …

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:
In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. The…