Tuesday, October 21, 2008

Wide Area Network Outage Analysis

The following is an brief analysis of unplanned network outages on a large state wide network with approximately 70 sites at bandwidths from DS3 to GigE. The data might be interesting to persons who need to estimate expected availability of wide area networks.

The network is standard core, hub, spoke, leaf. The core is fully redundant. The hubs have redundant circuits connecting to multiple hubs or cores, redundant power and partially redundant hardware. The spokes and leaf sites are non-redundant. 

The source or the data was a shared calendar where outages were recorded as calendar events. The data was gathered at analysis time and is subject to omissions and misinterpretation. Errors likely are undercounts.

Raw data, by approximate cause

  • 88 Total Outages
  • 290 Total Hours of Outage
  • 2 years calendar time
Failures by type and duration

Cause
# of Incidents
Percent
# of Hours
Percent
Circuit Failures
34
39%
168
58%
Equip Failures
24
66%
60
79%
Power Failures
22
91%
53
97%
Unknown
5
97%
7
99%
Other
3
100%
2
100%
         
Total
88
290
 

Column Definitions
# of Incidents = Raw count of outages affecting one or more sites
# of Hours = Sum of duration of outages affecting one or more sites
Percent = Cumulative Percentage of corresponding column

Cause Definitions
Circuit Failures = Failures determined to be circuit related, primarily fiber cuts
Equip Failures = Failures determined to be router, firewall or similar
Power Failures = Failures where site power was cause of outage
Unknown = Failure cause undetermined, missing information
Other = All other failures

Pareto Chart - Number of Incidents

A visual representation of the failures shows causes by number of outages. If I remember my statistical process control training from 20 years ago, a Pareto chart is the correct representation for this type of data. The chart shows outage cause on the X-axis, outage count on the left Y-axis and cumulative percent of outages on the right Y-axis.
Outages-Incidents
Using the Pareto 80/20 rule, solving circuit failure resolves 40% of outages by count. Solving equipment failures resolves another 25%. Solving power failures resolves another 25% of outages.

Power failures are probably the least costly to resolve. Long running UPS's are inexpensive. The individual sites supply power and UPS for network equipment at the leaf sites. The sites have variable configurations for power and UPS run times. The area has frequent severe weather, so power outages are common.

Circuit failures are the most expensive to solve. Circuits have high on going costs compared to hardware. The sites are already configured with the lowest cost available carrier, so redundant or protected circuits tend to be more costly than the primary circuit. Circuit failures also appear to be more frequent in areas with rapid housing growth, construction and related activity. For fiber paths provisioned above ground, storm related failures are common.

Pareto Chart - Hours of Outage

A representation of total outage duration in hours by cause is also interesting.
Outages-Hours
When considering the total number of hours without service, the causes occur in the same relative order. Solving circuit failures resolves 60% of the total outage hours. Circuit outages have a disproportionate share of total outage duration, likely because circuit failures take longer to resolve (MTTR is higher).

Availability Calculations

The network is composed of approximately 70 sites (the number varies over time). The time frame of the raw data is approximately two years. The numbers are approximations.

Outage Frequency:

  • 70 sites * 2 years = 140 site-years.
  • 88 outages /140 site-years = .6 outages/year.
  • 140 site-years / 88 outages = 1.6 years MTBF
Sites should expect to have slightly less than one unplanned outage per year on average, over time. Caution is advised, as the nature of this calculation precludes using it to predict the availability of a specific site.

Outage Duration:

Availability is calculated simply as

(Hours Actually Available)/(Hours Possibly Available)
  • 70 sites * 2 years * 8760 hours/year = 1.23m Hours possible
  • 1.23m hours -288 hours = Hours actually available
  • (1.23m hours -288 hours )/(1.23m hours )= 99.95% availability.
Availability on average should be three nines or better.

This syncs up fairly well with what we've intuitively observed for sites with non-redundant networks. Our seat of the pants rule is that a non-redundant site should expect about 8 hours unplanned outage per year. We assume that Murphy's Law will make the failure on the most critical day of the year, and we expect that areas with rapid housing development or construction will have more failures.

This also is consistent with service provider SLA’s. In most cases, our providers offer 99.9% availability SLA’s on non-redundant, non-protected circuits.

A uniquely regional anomaly is the seasonal construction patterns in the area. Frost depth makes most underground construction cost prohibitive for 5 months of the year, so construction related outages tend to be seasonal.

The caveat of course, is that some sites may experience much higher or lower availability than other sites.

Related posts: Estimating the Availability of Simple Systems

4 comments:

  1. BTW, Michael,

    I saw this today and thought of this post:

    http://perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

    Good numbers regarding failure per machine at Google. I didn't know if you'd seen it or not.

    ReplyDelete
  2. Matt - I saw that, and was surprised at how high the failure rate was. They have the 'smart software, cheap hardware' mantra though, so it must work out.

    Our software isn't smart. ;)

    ReplyDelete
  3. I can't afford the volume of hardware they must order at a time to get it as cheap as they do, so it wouldn't work for me anyway. :-)

    ReplyDelete