Scalability at eBay

Sometimes we get a real treat. In an article article posted a couple days ago at infoq.com, Randy Shoup, architect at eBay, outlines eBays' scalability best practices. What's interesting about this particular scalability article is that it appears that the practices can generally be applied at 'smaller than eBay' scale.

There are plenty of cases where scalability simply doesn't matter. Hardware is fast and keeps getting faster. If the problem can be solved with a handful of CPU's, then throw a handful of CPU's at the problem. That'll be cheaper than any other scalability solution. Unfortunately, as an application grows, throwing hardware at the problem costs more and returns less. (And somewhere around the 32 CPU's per database, it starts getting really expensive). Many of us are somewhere between 'it doesn't matter' on the low end, and 'two billion page views a day'. In the 'someplace in the middle' spot we are in, architecture matters.

Quick notes from the article:

Falling under the straight forward and implemental anywhere category are Partition by function, Split Horizontally, and Virtualize (Abstract) at al levels. Abstracting networks, hosts, databases and services can be done anywhere. App servers are easy to split horizontally. Databases are not.
More interesting is the move toward asynchronous rather than synchronous applications in: Decouple Functions Asynchronously and Move Processing To Asynchronous Flows.

A key bit:

"More often than not, the two components have no business talking directly to one another in any event. At every level, decomposing the processing into stages or phases, and connecting them up asynchronously, is critical to scaling."

I can see that making a difference even on medium-scale applications. In one of our applications, an innocent user hitting the 'submit' button can trigger a seven table, million I/O monstrosity of a synchronous update transaction buried in the middle of a multi-page stored procedure. The poor user has no clue - and the hardware groans in pain.

And - where best practice meets bottom line:

"...asynchrony can substantially reduce infrastructure cost. Performing operations synchronously forces you to scale your infrastructure for the peak load - it needs to handle the worst second of the worst day at that exact second. Moving expensive processing to asynchronous flows, though, allows you to scale your infrastructure for the average load instead of the peak."

Some rational advice on a caching strategy:

"....make sure your caching strategy is appropriate for your situation."
And finally, where it gets really interesting, is the use of the 'CAP Theorem', which trades
Consistency, Availability and Partition tolerance:

Best Practice #3: Avoid distributed transactions.

...For a high-traffic web site, we have to choose partition-tolerance, since it is fundamental to scaling. For a 24x7 web site, we typically choose availability. So immediate consistency has to give way.

That sort of turns thing upside down compared to the traditional ACID model.

Scaling eBay has got to be an interesting challenge. If we can pick up a few tips from the billion+ page view per day sites and use them on the million+ page view per day sites.......

(via Musings of an Anonymous Geek)