Skip to main content


Showing posts from May, 2008

Failure Tolerance - the Google way

Another 'gotta think about this' article, this time via Stephen Shankland's's summary of a Jeff Dean Google I/O presentation.

The first thing that piqued my interest is the raw failure rate of a new Google cluster. Jeff Dean is claiming that on new clusters, they have half as many failures as servers. (1800 servers, 1000 failures.) That seems high to me. Working with an order of magnitude fewer servers than a single Google cluster, we don't see anything like that kind of failure rate. Either there really is a difference in the reliability of cheap hardware, or I'm not comparing like numbers.

It's hard for me to imagine that failure rate working at a several orders of magnitude smaller scale. The incremental cost of 'normal' hardware over cheap hardware at the hundreds of servers scale isn't that large of a number. But Google has a cheap hardware, smart software mantra, so presumably the economics works out in their favor.

The second …

Scalability at eBay

Sometimes we get a real treat. In an article article posted a couple days ago at, Randy Shoup, architect at eBay, outlines eBays' scalability best practices. What's interesting about this particular scalability article is that it appears that the practices can generally be applied at 'smaller than eBay' scale.

There are plenty of cases where scalability simply doesn't matter. Hardware is fast and keeps getting faster. If the problem can be solved with a handful of CPU's, then throw a handful of CPU's at the problem. That'll be cheaper than any other scalability solution. Unfortunately, as an application grows, throwing hardware at the problem costs more and returns less. (And somewhere around the 32 CPU's per database, it starts getting really expensive). Many of us are somewhere between 'it doesn't matter' on the low end, and 'two billion page views a day'. In the 'someplace in the middle' spot we are in, archi…

The quarter-million dollar query

What does a query cost? In one recent case, about a quarter-million dollars.

The story (suitably anonymized): User requirements resulted in a new feature added to a busy application. Call it Widget 16. Widget 16 was complex enough that the database side of the widget put a measurable load on the server. That alone wouldn't have been a problem, but Widget 16 turned out to be a useful widget. So useful in fact, that users clicked on Widget 16 far more often have anticipated. Complex query + lots of clicks = load problem. How big was the load problem? In this case, the cost of the query (in CPU seconds, logical reads, etc.) and the number of times the query was executed were both relevant. After measuring the per-widget load on the server and multiplying by the number of widget clicks per second, we figured that widget cost at least a quarter million dollars per year in hardware and database license costs. Especially database licensing costs.

That was an interesting number.


Least Bit Server Installs - What NOT to do

Wow - I thought that this kind of crud had been eliminated years ago.

I did a live upgrade of my home office Solaris 10 server to the latest release (05/08), created a new zone, booted it and found:

online May_10 svc:/network/finger:default
online May_10 svc:/network/rpc/rusers:default
online May_10 svc:/network/login:rlogin

rlogin? finger? rusers? Is it 1995 all over again? I quit using them a decade ago.

So I did a little digging. The global zone still has the SUNWrcmds package, and that package got rolled into the new zone. The actual commands were disabled on the global zone years ago, back when this was a Solaris 8 server, but they were never uninstalled. The new zone inherited the package, but instead of leaving them disabled, they ended up enabled in the service framework on the new zone.

This is a classic example of why dead software needs to get uninstalled, not just disabled. I've seen ghosts from the past come back to haunt me more than once.


Service Deprovisioning as a Security Practice

Deprovisioning of things other than users is, from the point of view of a security person or system manager, both interesting and relatively unexplored. Applications, servers, databases and devices have a finite life, even if the bad ones seem to hang around forever. The end-of-life of these non-person things is an event that needs the attention if future security and operational issues are to be avoided.

When one thinks of deprovisioning, one generally thinks of user deprovisioning. The search engines are full of interesting bits in the user deprovisioning space, most of which are more interesting than my thoughts. I'm interested in deprovisioning of things other than users.

To organize the concept of non-user deprovisioning, I'll invent something called 'Service Deprovisioning'. A temporary definition of Service Deprovisioning:

When an application, server, database or device is no longer required for system functionality, it and its dependencies are taken out of s…