Last In - First Out

Failure Tolerance - the Google way

Another 'gotta think about this' article, this time via Stephen Shankland's news.com's summary of a Jeff Dean Google I/O presentation.

The first thing that piqued my interest is the raw failure rate of a new Google cluster. Jeff Dean is claiming that on new clusters, they have half as many failures as servers. (1800 servers, 1000 failures.) That seems high to me. Working with an order of magnitude fewer servers than a single Google cluster, we don't see anything like that kind of failure rate. Either there really is a difference in the reliability of cheap hardware, or I'm not comparing like numbers.

It's hard for me to imagine that failure rate working at a several orders of magnitude smaller scale. The incremental cost of 'normal' hardware over cheap hardware at the hundreds of servers scale isn't that large of a number. But Google has a cheap hardware, smart software mantra, so presumably the economics works out in their favor.

The second interesting bit is taking the Google-like server failure assumptions comparing them to network protocol failure assumptions. The TCP world assumes that losses will occur and the protocols need to gracefully and unobtrusively handle the loss. Wide area networks have always been like that. Wide area network engineers assume that there will be loss. The TCP protocol assumes that there will be loss. The protocols that route data and the end points of the session are smart. In network land, things work even with data loss and errors as high as 1%, and in HA network designs, we often assume that 50% of our hardware can fail and we'll still have a functional system.

In the case of routed wide area networks, it isn't necessarily cheap hardware that leads us to smart software (TCP, OSPF, BGP), it's unreliable circuits (the backhoe factor). In router and network land, smart software and redundancy compensates for unreliable WAN circuits. In theory smart software should let us run cheap hardware. In reality, we tend to have both smart software and expensive hardware. But a high hardware failure rate isn't really an option for network redundancy. If you've only got a single pair of HA routers 300 or 3000 miles away, and one dies, you've used up the +1 on your n+1 design. And it would be hard to build a remote routed node with more than a pair of redundant routers, mostly because the incoming telco circuit count would have to increase, and those things are expensive.

Compare that to SAN fabrics, file systems and relational databases, where a single missing bit is a catastrophe. There is simply no 'assumption of loss' in most application and database designs. The lack of loss tolerant designs in databases and file systems drives us to a model of both expensive hardware and expensive software.

There probably are a whole slew of places where we could take the engineering assumptions of one discipline apply them to an marginally related or unrelated discipline, with game-changing results.
(via Data Center Knowledge)

Scalability at eBay

Sometimes we get a real treat. In an article article posted a couple days ago at infoq.com, Randy Shoup, architect at eBay, outlines eBays' scalability best practices. What's interesting about this particular scalability article is that it appears that the practices can generally be applied at 'smaller than eBay' scale.

There are plenty of cases where scalability simply doesn't matter. Hardware is fast and keeps getting faster. If the problem can be solved with a handful of CPU's, then throw a handful of CPU's at the problem. That'll be cheaper than any other scalability solution. Unfortunately, as an application grows, throwing hardware at the problem costs more and returns less. (And somewhere around the 32 CPU's per database, it starts getting really expensive). Many of us are somewhere between 'it doesn't matter' on the low end, and 'two billion page views a day'. In the 'someplace in the middle' spot we are in, architecture matters.

Quick notes from the article:

Falling under the straight forward and implemental anywhere category are Partition by function, Split Horizontally, and Virtualize (Abstract) at al levels. Abstracting networks, hosts, databases and services can be done anywhere. App servers are easy to split horizontally. Databases are not.
More interesting is the move toward asynchronous rather than synchronous applications in: Decouple Functions Asynchronously and Move Processing To Asynchronous Flows.

A key bit:

"More often than not, the two components have no business talking directly to one another in any event. At every level, decomposing the processing into stages or phases, and connecting them up asynchronously, is critical to scaling."

I can see that making a difference even on medium-scale applications. In one of our applications, an innocent user hitting the 'submit' button can trigger a seven table, million I/O monstrosity of a synchronous update transaction buried in the middle of a multi-page stored procedure. The poor user has no clue - and the hardware groans in pain.

And - where best practice meets bottom line:

"...asynchrony can substantially reduce infrastructure cost. Performing operations synchronously forces you to scale your infrastructure for the peak load - it needs to handle the worst second of the worst day at that exact second. Moving expensive processing to asynchronous flows, though, allows you to scale your infrastructure for the average load instead of the peak."

Some rational advice on a caching strategy:

"....make sure your caching strategy is appropriate for your situation."
And finally, where it gets really interesting, is the use of the 'CAP Theorem', which trades
Consistency, Availability and Partition tolerance:

Best Practice #3: Avoid distributed transactions.

...For a high-traffic web site, we have to choose partition-tolerance, since it is fundamental to scaling. For a 24x7 web site, we typically choose availability. So immediate consistency has to give way.

That sort of turns thing upside down compared to the traditional ACID model.

Scaling eBay has got to be an interesting challenge. If we can pick up a few tips from the billion+ page view per day sites and use them on the million+ page view per day sites.......

(via Musings of an Anonymous Geek)

The quarter-million dollar query

What does a query cost? In one recent case, about a quarter-million dollars.

The story (suitably anonymized): User requirements resulted in a new feature added to a busy application. Call it Widget 16. Widget 16 was complex enough that the database side of the widget put a measurable load on the server. That alone wouldn't have been a problem, but Widget 16 turned out to be a useful widget. So useful in fact, that users clicked on Widget 16 far more often have anticipated. Complex query + lots of clicks = load problem. How big was the load problem? In this case, the cost of the query (in CPU seconds, logical reads, etc.) and the number of times the query was executed were both relevant. After measuring the per-widget load on the server and multiplying by the number of widget clicks per second, we figured that widget costs at least a quarter million dollars per year in hardware and database license costs. Especially database licensing costs.

That was an interesting number.

Obviously we want to use that number to help make the application better, faster and cheaper.

The developers - who are faced with having to balance impossible user requirements, short deadlines, long bug lists, and whiny hosting teams complaining about performance - likely will favor the former over the latter. We expect that to happen. Unfortunately if that happens too often, the hosting team is stuck with either poor performance or a large hardware and software bill. To properly prioritize the development work effort, some rational measurement must be made of the cost of re-working existing functionality to reduce load verses the value of using that same work effort to add user requested features. Calculating the cost of running a feature or widget makes the prioritization determination possible. In this case, the cost of running the query compared to the person-time required to design, code, test and deploy a solution made the decision to optimize the widget (or cache it) pretty easy to make.

DBA's already have the tools (Profiler, AWR) to determine the utilization of a feature as measured in CPU, Memory and I/O. Hosting managers have the dollar cost of the CPU's and I/O figured out. What the DBA's and managers need to do is merge the data, format it into something readable and feed it back to the developers. Closing the loop via feedback to developers is essential.

The relevant data may vary depending on your application, but the data that almost certainly will be interesting will include:

Number of executions per time period (second, minute, hour)
CPU cycles per execution
Logical and Physical I/O's per execution.

The rough approximation of CPU load on the database server will be # of executions x CPU cycles per execution. The I/O's per execution x number of executions will give you a rough estimate of SAN or disk load. Obviously you only have a finite number of CPU cycles & I/O's available each second, and those CPU's and related database licenses have dollar costs associated with them. The actual application CPU and I/O data, measured against the total CPU and I/O available from your system and the annual hardware and software cost of the system will give you an estimate of the overall cost in dollars to run the query.

Notice that I didn't mention query response time. From the point of view of the developer, response time is interesting. It's what the user will feel when they navigate the site, and it is easy to measure. From a capacity/load point of view however, response time itself doesn't indicate the back-end CPU & I/O cost to achieve the response time. If the query or stored procedure returned in 200ms, but during that 200ms it paralleled out across all your CPU's and used up all available CPU time, you'll obviously only be able to do a handful of those each second, and if you try to do more than a handful per second, your application isn't going to be happy. Or if in that 200ms, it used 200ms of CPU time, you'll only be able to execute a handful of that query per CPU each second. In other words, focusing only on response time isn't adequate because it doesn't relate back to load related bottlenecks on the application and database servers.

For those who haven't seen an AWR report, Oracle has an example here. An AWR report allows your DBA's and dev team to slice, dice sort and analyze the heck out of an application. For SQL server we built a system that runs periodic profiler traces, uploads the trace to a database, and dumps out reports similar to Oracle AWR's.

The bottom line: In order for application developers to successfully design and build efficient, scalable applications, they need to have comprehensive performance related data. They need to be able to 'see' all the way through from the web interface to the database. If they do not have data, they cannot build efficient, scalable applications.