Skip to main content

Scalability at eBay

Sometimes we get a real treat. In an article article posted a couple days ago at, Randy Shoup, architect at eBay, outlines eBays' scalability best practices. What's interesting about this particular scalability article is that it appears that the practices can generally be applied at 'smaller than eBay' scale.

There are plenty of cases where scalability simply doesn't matter. Hardware is fast and keeps getting faster. If the problem can be solved with a handful of CPU's, then throw a handful of CPU's at the problem. That'll be cheaper than any other scalability solution. Unfortunately, as an application grows, throwing hardware at the problem costs more and returns less. (And somewhere around the 32 CPU's per database, it starts getting really expensive). Many of us are somewhere between 'it doesn't matter' on the low end, and 'two billion page views a day'. In the 'someplace in the middle' spot we are in, architecture matters.

Quick notes from the article:

Falling under the straight forward and implemental anywhere category are Partition by function, Split Horizontally, and Virtualize (Abstract) at al levels. Abstracting networks, hosts, databases and services can be done anywhere. App servers are easy to split horizontally. Databases are not.
More interesting is the move toward asynchronous rather than synchronous applications in: Decouple Functions Asynchronously and Move Processing To Asynchronous Flows.

A key bit:

"More often than not, the two components have no business talking directly to one another in any event. At every level, decomposing the processing into stages or phases, and connecting them up asynchronously, is critical to scaling."

I can see that making a difference even on medium-scale applications. In one of our applications, an innocent user hitting the 'submit' button can trigger a seven table, million I/O monstrosity of a synchronous update transaction buried in the middle of a multi-page stored procedure. The poor user has no clue - and the hardware groans in pain.

And - where best practice meets bottom line:

"...asynchrony can substantially reduce infrastructure cost. Performing operations synchronously forces you to scale your infrastructure for the peak load - it needs to handle the worst second of the worst day at that exact second. Moving expensive processing to asynchronous flows, though, allows you to scale your infrastructure for the average load instead of the peak."

Some rational advice on a caching strategy:

"....make sure your caching strategy is appropriate for your situation."
And finally, where it gets really interesting, is the use of the 'CAP Theorem', which trades
Consistency, Availability and Partition tolerance:

Best Practice #3: Avoid distributed transactions.

...For a high-traffic web site, we have to choose partition-tolerance, since it is fundamental to scaling. For a 24x7 web site, we typically choose availability. So immediate consistency has to give way.

That sort of turns thing upside down compared to the traditional ACID model.

Scaling eBay has got to be an interesting challenge. If we can pick up a few tips from the billion+ page view per day sites and use them on the million+ page view per day sites.......

(via Musings of an Anonymous Geek)


  1. Thanks for the link! It's a very instructive and useful read, if you're trying to improve your infrastructure.


Post a Comment

Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …