Skip to main content

$100 million dollars per mile and no redundancy?

“Light-rail service throughout downtown Minneapolis was halted Thursday for about four hours because of a downed wire that powers the trains from overhead…”

Apparently there is no redundancy.

I’m not thinking about this because I care about the commuters who were stranded, but rather because of how it relates to network and server redundancy and availability.

My group delivers state wide networking, firewalling, ERP and eLearning applications to a couple hundred thousand students and tens of thousands of employees.

  • Availability is expensive
  • We hear about it when our systems suck
  • We have no data that can tell us how much an outage costs. We are an .edu. Our students don’t switch vendors if they can’t access our systems for a few hours.

In that environment, how do you make a cost vs. availability decision?


Years ago (cira 2001) we found a carrier that would offer us OC-3 (150mbps) for what was essentially the same price as the incumbent telco (Qwest) would charge us for two T1’s (3mbps). The new carrier had less experience with data. They were a cable TV provider.

  • We expected availability to be worse than the incumbents telco class T1’s.
  • We knew full well that the carrier had a single point of failure on a fiber path through a small town, and if there was a fiber cut in that town, 50,000 students would get knocked off the network. Unlike most carriers, they told us where they were weak.

What decision would you make?

I hesitated, figuring that it’d be best to get feedback from the campuses that were currently starved for bandwidth, so I posed the question at our quarterly CIO meeting.

Me: “If I have to make a choice, would you trade availability for bandwidth?”

CIO’s: “Yep.”

(At least that’s how I remember it…)

The carrier deliver excellent service and availability. Service was awesome, availability was at least as good as the expensive incumbent. We gained confidence in the low cost carriers ability to deliver, and year by year we migrated more campuses to the low cost carrier.

Life was good. We traded 3Mbps for 150mbps without increasing the budget. Hail to the hero (that’d be me).

Fast forward 4 years. The small town in the path of the low cost carrier’s non-redundant leg decided to build a road. The town cut the fiber, a dozen campuses disappeared from our network map for 8 hours, and to put it mildly, the campuses were not happy.

It gets better worse though. The carrier patched up the busted fiber late that day then scheduled a plow crew to come out a week later, bury a new fiber parallel to the old fiber and move us to the new fiber.

Yep, the carriers crew cut their own fiber. Another half day outage.

It gets even better worse. The carrier didn’t have facilities all they way to our core where we needed them so they leased a big carrier’s circuit to the big city. A few weeks after the two fiber outages, the big carrier in the big city smoked a network interface.

Outage number three. Let’s all throw rocks at the goat (that’d be me).

The next quarterly meeting wasn’t fun.

Suffice to say that we now make a different calculation on the relative value of bandwidth vs. availability.

Back to the broken train. A four hour outage, construction cost of a hundred million dollars per mile and no redundancy?



  1. I suppose sometimes redundancy is tough, if not impossible. The rails themselves are also not redundant (though probably a bit more robust!). I'm trying to think of a way that the overhead wires could be redundant... any ideas?

  2. Eric -

    I'm assuming This would be a case where redundancy would be difficult or expensive. My first thought was that one could have multiple overhead wires for each leg of the circuit, but I'm not sure that would have mattered. If there were redundant pairs of wires an one broke, the train would still have been held up until the broken wire was cleared from the track.

    My unscientific observation is that for some systems (electric power, for example), we accept outages as tolerable, while in other cases (enterprise networks) we treat outages as something intolerable.

    That just my perception though.

  3. Even redundancy is not always that helpful, as would likely be the case for the train.

    I experienced an outage of redundant fiber circuits running to two different locations. A vandal cut a bundle with 100 circuits miles from our office. Couldn't take out both our circuits could it? The cut was yards from the building where our circuits began redundant path.

    To make it worse each fiber circuit had a contract specifying the order it would be repaired. Nothing much happened until the tech decided to fix circuits as he identified them rather than in contracted order.


Post a Comment

Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …