Saturday, November 14, 2009

Degraded Operations - Gracefully

From James Hamilton’s Degraded Operations Mode:
“In Designing and Deploying Internet Scale Services I’ve argued that all services should expect to be overloaded and all services should expect mass failures.  Very few do and I see related down-time in the news every month or so.....We want all system to be able to drop back to a degraded operation mode that will allow it to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures.”

I've had high visibility applications fail into 'degraded operations mode'. Unfortunately it has not always been a designed, planned or tested failure mode, but rather a quick reaction to an ugly mess. A graceful degrade plan is better than random degradation, even if the plan something as simple as a manual intervention to disable features in a controlled manner rather than letting then fail in an uncontrolled manner.

On some applications we've been able to plan and execute graceful service degradation by disabling non-critical features. In one case, we disabled a scheduling widget in order to maintain sufficient headroom for more important functions like quizzing and exams, in other cases, we have the ability to limit the size of shopping carts or restrict financial aid and grade re-calcs during peak load.

Degraded operations isn't just an application layer concept. Network engineers routinely build forms of degraded operations into their designs. Networks have been congested since the day they were invented, and as you'd expect, the technology available for handling degraded operations is very mature. On a typical network, QOS (Quality of Service) policy and configuration is used to maintain critical network traffic and shed non-critical traffic.

As and example, on our shared state wide backbone, we assume that we'll periodically end up in some sort of degraded mode, either because a primary circuit has failed and the backup paths don't have adequate bandwidth, because we experience inbound DOS attacks, or perhaps because we simply don't have adequate bandwidth.  In our case, the backbone is shared by all state agencies, public colleges and universities, including state and local law enforcement, so inter-agency collaboration is necessary when determining what needs to get routed during a degraded state.

A simplified version of the traffic priority on the backbone is:

Highest Priority Router Traffic (BGP, OSPF, etc.)
  Law Enforcement
  Voice
  Interactive Video
  Intra-State Data
Lowest Priority Internet Data

When the network is degraded, we presume that law enforcement traffic should be near the head of the queue. We consider interactive video conferencing to be business critical (i.e. we have to cancel classes when interactive classroom video conferencing is broke), so we keep it higher in the priority order than ordinary data. We have also decided that commodity Internet should be the first traffic to discarded when the network is degraded.

Unfortunately on the part of the application stack that's hardest to scale, the database, there is no equivalent to network QOS or traffic engineering.  I as far as I know, I don't have the ability to tag a query or stored procedure with a few extra bits that tell the database engine to place the query at the head of the work queue, discarding other less important work if necessary. It's not hard to imagine a 'discard eligible' bit that could be set on certain types of database processes or on work submitted by certain clients. The database, if necessary, would discard that work, or place the work in a 'best effort' scheduling class and run if if & when it has free CPU cycles.

If the engineers at the major database vendors would Google 'Weighted Fair Queuing' or 'Weighted Random Early Detect' we might someday see interesting new ways of managing degraded databases.

No comments:

Post a Comment