Tuesday, October 27, 2009

No, we are not running out of bandwidth.

The sky is falling! the sky is falling!

Actually, we’re running out of bandwidth (PDF) (again).

Supposedly all the workers who stay home during the pandemic will use up all the bandwidth in the neighborhood. Let me guess, instead of surfing p0rn and hanging out on Reddit all day at work, they’ll be surfing p0rn and hanging out on Reddit all day from home.

The meat of the study:
“Specifically, at the 40 percent absenteeism level, the study predicted that most users within residential neighborhoods would likely experience congestion when attempting to use the Internet”
So what’s the problem?
“..under a cable architecture, 200 to 500 individual cable modems may be connected to a provider’s CMTS, depending on average usage in an area. Although each of these individual modems may be capable of receiving up to 7 or 8 megabits per second (Mbps) of incoming information, the CMTS can transmit a  maximum of only about 38 Mbps.”
Ooops – someone is oversubscribed just a tad. At least now we know how much.

Wait – 40% of the working population is at home, working or sick, bored to death, surfing the web. And they’ll be transferring large documents! Isn’t that what we call the weekend? So is the Internet broke on weekends? If so, I never noticed.

What about evenings? We  have a secondary utilization peak on our 24/7 apps around 10pm local time. That peak is almost exclusively people at home, working. Presumably this new daytime peak will dwarf the late evening peak?

Here’s a reason to panic. If it gets bad enough, the clowns who threw away ten trillion dollars of other peoples money on math they didn’t understand will not be able to throw away other peoples money while telecommuting:
“If several of these large firms were unable or unwilling to operate, the markets might not have sufficient trading volume to function in an orderly or fair way.”
My thought? Slow them down. When they flew at Mach 2, they smacked into a wall and took us with them.

Got to love this:
“Providers identified one technically feasible alternative that has the potential to reduce Internet congestion during a pandemic, but raised concerns that it could violate customer service agreements and thus would require a directive from the government to implement.”
Provider: “Yah know? I’d be cool if we could get the government to make us throttle that bandwidth. Yep, that’d be cool.”

How about Plan B – shut off streaming video:
“Shutting down specific Internet sites would also reduce congestion, although many we spoke with expressed concerns about the feasibility of such an approach.”
Wait – isn’t that what CDN’s are for? The Akamai cache at the local ISP has the content, all that matters is the last mile, right? For reference, with a few hundred thousand students working hard a surfing the web all day, we slurp up about 1/3 of our Internet bandwidth from the local Akamai rack directly attached to the Internet POP, (settlement free) and another 1/3 by peering directly with big content providers (also settlement free).

I’m not worried about bandwidth. If any of this were serious, we’d have been able to detect the effect of 10% unemployment on home bandwidth. Or the Internet would have broke during the 2008 election. Or what-his-names sudden death.

A more interesting potential outcome of a significant pandemic would be the gradual degradation of services as the technical people get sick and/or stay home with their families. I’d expect a significantly longer MTTR on routine outages during a real pandemic.

Would the cable tech show up at my house today, with two people flu’d out? Not if she’s smart.

Note:
  • The report is oriented toward the financial sector. The trades must go on. There are quarterly bonuses to be had.
  • The DHS commented on the draft in the appendices. They’ve attempted to inject a bit of rationality into the report.

Wednesday, October 14, 2009

Maintenance, Downtime and Outages

Via Data Center Knowledge - Maintenance, Downtime and Outages a quote from Ken Brill of The Uptime Institute:

"The No. 1 reason for catastrophic facility failure is lack of electrical maintenance,” Brill writes. “Electrical connections need to be checked annually for hot spots and then physically tightened at least every three years. Many sites cannot do this because IT’s need for uptime and the facility department’s need for maintenance downtime are incompatible. Often IT wins, at least in the short term. In the long term, the underlying science of materials always wins.”

Of course the obvious solution is to have redundant power in the data center & perform the power maintenance on one leg of the power at a time. One of our leased spaces does that. The other of our leased spaces has cooling and power shutdowns often enough that we have a very well rehearsed shutdown & startup plan. The point is well taken though. If you don’t do routine maintenance, you can complain about certain types of failure.

In IBM's case, it was the routine electrical maintenance that caused the outage. Apparently IBM didn't build out sufficient power redundancy for Air New Zealand's mainframe. A routine generator test failed and Air NZ’s mainframe lost power.

Air New Zealand CEO Rob Fyfe wasn't happy:

"In my 30-year working career, I am struggling to recall a time where I have seen a supplier so slow to react to a catastrophic system failure such as this and so unwilling to accept responsibility and apologise[sic] to its client and its client's customers,"

"We were left high and dry and this is simply unacceptable. My expectations of IBM were far higher than the amateur results that were delivered yesterday, and I have been left with no option but to ask the IT team to review the full range of options available to us to ensure we have an IT supplier whom we have confidence in and one who understands and is fully committed to our business and the needs of our customers."

I wonder if Air NZ contracted with IBM for a Tier 4 data center and/or a hot site with remote clustering? If so, Rob Fyfe has a point. If Air NZ went the cheap route he really can't complain. It's not like data centers don't get affected by storms, rats, power outages, floods & earthquakes. Especially power, power, and occasionally fire, fire or cooling. Oh yea, and don't forget storage failures.

In the Air NZ case, the one hour power outage seems to have resulted in a six hour application outage. If you spend any time at all thinking about MTTR (Mean Time To Repair), having an application suite that takes 5 hours to recover from a one hour power failure isn't a well thought out architecture for a service as critical as an airline check in/ticketing/reservation system.

Unfortunately the aftermath of a power failure can be brutal. Even in our relatively simple environment, we spend at least a couple hours cleaning up after a power related outage, generally for a couple of reasons:

  • It’s the 21st century and we still have applications that aren't smart enough to recover from simple network and database connectivity errors. This is beyond dumb. It should matter what order that you start servers and processes. I keep thinking that we need to make developers desktops & test servers less reliable, just so they'll build better error handling into their apps.
  • It’s the 21st century and we still have software that doesn’t crash gracefully. Dumb software is expensive (Google thinks so…).
  • The larger the server, the longer it takes to boot. In some cases, boot time is so bad that you can't have an outage less than an hour.
  • It’s the 21st century and we still have complex interdependent scheduled jobs that need to be restarted in a coordinated (choreographed) dance.

An amusing aside, as I was using an online note taking tool to rough in this post, the provider (Ubernote) went offline. They came back on line about 10 minutes later - with about 10 minutes of data loss.

Tuesday, October 13, 2009

Data Loss, Speculation & Rumors

My head wants to explode. Not because of Microsoft's catastrophic data loss. That's a major failure that should precipitate a significant numbers of RGE's at Microsoft.

My head wants to explode because of the speculation, rumors and mis-information surrounding the failure. An example: rumors reported by Daniel Eran Dilger in Microsoft’s Sidekick/Pink problems blamed on dogfooding and sabotage that point to intentional data loss or sabotage.

"Danger's existing system to support Sidekick users was built using an Oracle Real Application Cluster, storing its data in a SAN (storage area network) so that the information would be available to a cluster of high availability servers. This approach is expressly designed to be resilient to hardware failure."

and

"the fact that no data could be recovered after the problem erupted at the beginning of October suggests that the outage and the inability to recover any backups were the result of intentional sabotage by a disgruntled employee."

and

"someone with access to the servers at the datacenter must have inserted a time bomb to wipe out not just all of the data, but also all of the backup tapes, and finally, I suspect, reformatting the server hard drives so that the service itself could not be restarted with a simple reboot (and to erase any traces of the time bomb itself)."

Intentional sabotage?

How about some rational speculation, without black helicopters & thermite charges? How about simple mis-management? How about failed backup jobs followed by a failed SAN upgrade? I've had both backup and SAN failures in the past & fully expect to continue to have both types of failures again before I retire. Failed backups are common. Failed SAN's and failed SAN upgrades are less common, but likely enough that they must be accounted for in planning.

Let's assume as Dan speculates, it’s an Oracle RAC cluster. If all the RAC data is on a single SAN, with no replication or backups to separate media on separate controllers, then a simple human error configuring the SAN can easily result in spectacular failure.  If there is no copy of the Oracle data on separate media & separate controllers on a separate server, you CAN loose data in a RAC cluster. RAC doesn't magically protect you from SAN failure. RAC doesn't magically protect you from logical database corruption or DBA fat fingering. All the bits are still stored on a SAN, and if the SAN fails or the DBA fails, the bits are gone. Anyone who doesn't think that's true hasn't driven a SAN or a database for a living.

We've owned two IBM DS-4800's for the last four years and have had three controller related failures that could have resulted in data loss had we not taken the right steps at the right time (with IBM advanced support looking over our shoulders). A simple thing like a mis-matched firmware somewhere between the individual drives and the controllers or the HBA's and the operating system has the potential to cause data loss. Heck - because SAN configs can be stored on the individual drives themselves, plugging a used drive with someone else's SAN config into your SAN can cause SAN failure - or at least catastrophic, unintentional SAN reconfiguration.

I've got an IBM doc (sg246363) that says:

"Prior to physically installing new hardware, refer to the instructions in IBM TotalStorage DS4000 hard drive and Storage Expansion Enclosure Installation and Migration Guide, GC26-7849, available at:  [...snip...] Failure to consult this documentation may result in data loss, corruption, or loss of availability to your storage."

Does that imply that plugging the wrong thing into the wrong place at the wrong time can 'f up an entire SAN? Yep. It does, and from what the service manager for one of our vendors told me, it did happen recently – to one of the local Fortune 500’s.

If you don't buy that speculation, how about a simple misunderstanding between the DBA and the SAN team?

DBA: "So these two LUN's are on separate controllers, right?

SAN: "Yep."

Does anyone work at a place where that doesn't happen?

As for the ‘insider’ quote: "I don't understand why they would be spending any time upgrading stuff", a simple explanation would be that somewhere on the SAN, out of the hundreds of attached servers, a high profile project needed the latest SAN controller firmware or feature. Let's say, for example, that you wanted to plug a shiny new Windows 2008 server into the fabric and present a LUN from and older SAN. You'd likely have to upgrade the SAN firmware. To have a supported configuration, there is a fair chance that you'd have to upgrade HBA firmware and HBA drivers on all SAN attached servers. The newest SAN controller firmware that's required for 'Project 2008' then forces an across the board upgrade of all SAN attached servers, whether they are related to ‘Project 2008’ or not. It's not like that doesn't happen once a year or so, and it’s the reason that SAN vendors publish ‘compatibility matrixes’.

And upgrades sometimes go bad.

We had an HP engineer end up hospitalized a couple days before a major SAN upgrade. A hard deadline prevented us from delaying the upgrade. The engineer's manager flew in at the last minute, reviewed the docs, did the upgrade - badly. Among other f'ups, he presented a VMS LUN to a Windows server. The Windows server touched the LUN & scrambled the VMS file system. A simple error, a catastrophic result. Had that been a RAC LUN, the database would have been scrambled. It happened to be a VMS LUN that was recoverable from backups, so we survived.

Many claim this is a cloud failure.  I don't. As far as I can see, it's a service failure, plain and simple, independent of how it happens to be hosted. If the data was stored on an Amazon, Google or Azure cloud, and if the Amazon, Google or Azure cloud operating system or storage software scrambled and/or lost the data, then it'd be a cloud failure. The data appears to have been on ordinary servers in an ordinary database on a ordinary SAN.

That makes this an ordinary failure.