Failure Tolerance - the Google way

Another 'gotta think about this' article, this time via Stephen Shankland's's summary of a Jeff Dean Google I/O presentation.

The first thing that piqued my interest is the raw failure rate of a new Google cluster. Jeff Dean is claiming that on new clusters, they have half as many failures as servers. (1800 servers, 1000 failures.) That seems high to me. Working with an order of magnitude fewer servers than a single Google cluster, we don't see anything like that kind of failure rate. Either there really is a difference in the reliability of cheap hardware, or I'm not comparing like numbers.

It's hard for me to imagine that failure rate working at a several orders of magnitude smaller scale. The incremental cost of 'normal' hardware over cheap hardware at the hundreds of servers scale isn't that large of a number. But Google has a cheap hardware, smart software mantra, so presumably the economics works out in their favor.

The second interesting bit is taking the Google-like server failure assumptions comparing them to network protocol failure assumptions. The TCP world assumes that losses will occur and the protocols need to gracefully and unobtrusively handle the loss. Wide area networks have always been like that. Wide area network engineers assume that there will be loss. The TCP protocol assumes that there will be loss. The protocols that route data and the end points of the session are smart. In network land, things work even with data loss and errors as high as 1%, and in HA network designs, we often assume that 50% of our hardware can fail and we'll still have a functional system.

In the case of routed wide area networks, it isn't necessarily cheap hardware that leads us to smart software (TCP, OSPF, BGP), it's unreliable circuits (the backhoe factor). In router and network land, smart software and redundancy compensates for unreliable WAN circuits. In theory smart software should let us run cheap hardware. In reality, we tend to have both smart software and expensive hardware. But a high hardware failure rate isn't really an option for network redundancy. If you've only got a single pair of HA routers 300 or 3000 miles away, and one dies, you've used up the +1 on your n+1 design. And it would be hard to build a remote routed node with more than a pair of redundant routers, mostly because the incoming telco circuit count would have to increase, and those things are expensive.

Compare that to SAN fabrics, file systems and relational databases, where a single missing bit is a catastrophe. There is simply no 'assumption of loss' in most application and database designs. The lack of loss tolerant designs in databases and file systems drives us to a model of both expensive hardware and expensive software.

There probably are a whole slew of places where we could take the engineering assumptions of one discipline apply them to an marginally related or unrelated discipline, with game-changing results.
(via Data Center Knowledge)

Scalability at eBay

Sometimes we get a real treat. In an article article posted a couple days ago at, Randy Shoup, architect at eBay, outlines eBays' scalability best practices. What's interesting about this particular scalability article is that it appears that the practices can generally be applied at 'smaller than eBay' scale.

There are plenty of cases where scalability simply doesn't matter. Hardware is fast and keeps getting faster. If the problem can be solved with a handful of CPU's, then throw a handful of CPU's at the problem. That'll be cheaper than any other scalability solution. Unfortunately, as an application grows, throwing hardware at the problem costs more and returns less. (And somewhere around the 32 CPU's per database, it starts getting really expensive). Many of us are somewhere between 'it doesn't matter' on the low end, and 'two billion page views a day'. In the 'someplace in the middle' spot we are in, architecture matters.

Quick notes from the article:

Falling under the straight forward and implemental anywhere category are Partition by function, Split Horizontally, and Virtualize (Abstract) at al levels. Abstracting networks, hosts, databases and services can be done anywhere. App servers are easy to split horizontally. Databases are not.
More interesting is the move toward asynchronous rather than synchronous applications in: Decouple Functions Asynchronously and Move Processing To Asynchronous Flows.

A key bit:

"More often than not, the two components have no business talking directly to one another in any event. At every level, decomposing the processing into stages or phases, and connecting them up asynchronously, is critical to scaling."

I can see that making a difference even on medium-scale applications. In one of our applications, an innocent user hitting the 'submit' button can trigger a seven table, million I/O monstrosity of a synchronous update transaction buried in the middle of a multi-page stored procedure. The poor user has no clue - and the hardware groans in pain.

And - where best practice meets bottom line:

"...asynchrony can substantially reduce infrastructure cost. Performing operations synchronously forces you to scale your infrastructure for the peak load - it needs to handle the worst second of the worst day at that exact second. Moving expensive processing to asynchronous flows, though, allows you to scale your infrastructure for the average load instead of the peak."

Some rational advice on a caching strategy:

"....make sure your caching strategy is appropriate for your situation."
And finally, where it gets really interesting, is the use of the 'CAP Theorem', which trades
Consistency, Availability and Partition tolerance:

Best Practice #3: Avoid distributed transactions.

...For a high-traffic web site, we have to choose partition-tolerance, since it is fundamental to scaling. For a 24x7 web site, we typically choose availability. So immediate consistency has to give way.

That sort of turns thing upside down compared to the traditional ACID model.

Scaling eBay has got to be an interesting challenge. If we can pick up a few tips from the billion+ page view per day sites and use them on the million+ page view per day sites.......

(via Musings of an Anonymous Geek)

The quarter-million dollar query

What does a query cost? In one recent case, about a quarter-million dollars.

The story (suitably anonymized): User requirements resulted in a new feature added to a busy application. Call it Widget 16. Widget 16 was complex enough that the database side of the widget put a measurable load on the server. That alone wouldn't have been a problem, but Widget 16 turned out to be a useful widget. So useful in fact, that users clicked on Widget 16 far more often have anticipated. Complex query + lots of clicks = load problem. How big was the load problem? In this case, the cost of the query (in CPU seconds, logical reads, etc.) and the number of times the query was executed were both relevant. After measuring the per-widget load on the server and multiplying by the number of widget clicks per second, we figured that widget costs at least a quarter million dollars per year in hardware and database license costs. Especially database licensing costs.

That was an interesting number.

Obviously we want to use that number to help make the application better, faster and cheaper.

The developers - who are faced with having to balance impossible user requirements, short deadlines, long bug lists, and whiny hosting teams complaining about performance - likely will favor the former over the latter. We expect that to happen. Unfortunately if that happens too often, the hosting team is stuck with either poor performance or a large hardware and software bill. To properly prioritize the development work effort, some rational measurement must be made of the cost of re-working existing functionality to reduce load verses the value of using that same work effort to add user requested features. Calculating the cost of running a feature or widget makes the prioritization determination possible. In this case, the cost of running the query compared to the person-time required to design, code, test and deploy a solution made the decision to optimize the widget (or cache it) pretty easy to make.

DBA's already have the tools (Profiler, AWR) to determine the utilization of a feature as measured in CPU, Memory and I/O. Hosting managers have the dollar cost of the CPU's and I/O figured out. What the DBA's and managers need to do is merge the data, format it into something readable and feed it back to the developers. Closing the loop via feedback to developers is essential.

The relevant data may vary depending on your application, but the data that almost certainly will be interesting will include:

  • Number of executions per time period (second, minute, hour)
  • CPU cycles per execution
  • Logical and Physical I/O's per execution.

The rough approximation of CPU load on the database server will be # of executions x CPU cycles per execution. The I/O's per execution x number of executions will give you a rough estimate of SAN or disk load. Obviously you only have a finite number of CPU cycles & I/O's available each second, and those CPU's and related database licenses have dollar costs associated with them. The actual application CPU and I/O data, measured against the total CPU and I/O available from your system and the annual hardware and software cost of the system will give you an estimate of the overall cost in dollars to run the query.

Notice that I didn't mention query response time. From the point of view of the developer, response time is interesting. It's what the user will feel when they navigate the site, and it is easy to measure. From a capacity/load point of view however, response time itself doesn't indicate the back-end CPU & I/O cost to achieve the response time. If the query or stored procedure returned in 200ms, but during that 200ms it paralleled out across all your CPU's and used up all available CPU time, you'll obviously only be able to do a handful of those each second, and if you try to do more than a handful per second, your application isn't going to be happy. Or if in that 200ms, it used 200ms of CPU time, you'll only be able to execute a handful of that query per CPU each second. In other words, focusing only on response time isn't adequate because it doesn't relate back to load related bottlenecks on the application and database servers.

For those who haven't seen an AWR report, Oracle has an example here. An AWR report allows your DBA's and dev team to slice, dice sort and analyze the heck out of an application. For SQL server we built a system that runs periodic profiler traces, uploads the trace to a database, and dumps out reports similar to Oracle AWR's.

The bottom line: In order for application developers to successfully design and build efficient, scalable applications, they need to have comprehensive performance related data. They need to be able to 'see' all the way through from the web interface to the database. If they do not have data, they cannot build efficient, scalable applications.

Least Bit Server Installs - What NOT to do

Wow - I thought that this kind of crud had been eliminated years ago.

I did a live upgrade of my home office Solaris 10 server to the latest release (05/08), created a new zone, booted it and found:

online         May_10   svc:/network/finger:default
online May_10 svc:/network/rpc/rusers:default
online May_10 svc:/network/login:rlogin

rlogin? finger? rusers? Is it 1995 all over again? I quit using them a decade ago.

So I did a little digging. The global zone still has the SUNWrcmds package, and that package got rolled into the new zone. The actual commands were disabled on the global zone years ago, back when this was a Solaris 8 server, but they were never uninstalled. The new zone inherited the package, but instead of leaving them disabled, they ended up enabled in the service framework on the new zone.

This is a classic example of why dead software needs to get uninstalled, not just disabled. I've seen ghosts from the past come back to haunt me more than once.

Read the Least Bit Principle. Then do as I say, not as ( ...cough.... ahem...) I do.

Unfortunately, in this case, it looks like Sun has those long-dead-obsolete-unsecure-replaced-by-SSH commands packaged up with enough dependencies that I'm not sure I'll be able to remove them cleanly.


Service Deprovisioning as a Security Practice

Deprovisioning of things other than users is, from the point of view of a security person or system manager, both interesting and relatively unexplored. Applications, servers, databases and devices have a finite life, even if the bad ones seem to hang around forever. The end-of-life of these non-person things is an event that needs the attention if future security and operational issues are to be avoided.

When one thinks of deprovisioning, one generally thinks of user deprovisioning. The search engines are full of interesting bits in the user deprovisioning space, most of which are more interesting than my thoughts. I'm interested in deprovisioning of things other than users.

To organize the concept of non-user deprovisioning, I'll invent something called 'Service Deprovisioning'. A temporary definition of Service Deprovisioning:

When an application, server, database or device is no longer required for system functionality, it and its dependencies are taken out of service using a structured methodology.
To ensure that happens consistently:

A cross boundary process must be established that assures that the service and its related dependencies, including application software, databases, firewall and load balancer configs, network, storage and server infrastructure are properly deprovisioned.
The concept assumes that system managers actually know when a service is no longer in use. In some cases, where the service is upgraded or re-deployed, the point at which the old service is no longer 'in production' is easy to determine. In other cases the service may gradually fall into disuse and not have a definitive date at which it may be deprovisioned. The concept also assumes that the various entities within an organization communicate to each other the services that are being deprovisioned, so that for example firewall administrators know that an application is no longer in use and DBAs know that a database can be dropped.

The high level tasks can be broken down into at least:

  • Firewall and Load Balancer rule deprovisioning
  • Software Deprovisioning
  • Database Deprovisioning
  • Physical Server Decommissioning
  • Virtual Sever Deprovisioning
  • Network Deprovisioning
  • Storage Deprovisioning
(If I think of more, I'll update this post.)

Firewall and Load Balancer rule deprovisioning is probably the most critical task associated with service deprovisioning. The outward facing interface to the service normally should be the first item on the deprovisioning check list. Unused or dead rules have a high probability of causing future security incidents as new services are deployed in the same network address space.

In the best case, all the rules associated with an application, server or database are well documented and well understood. Deprovisioning the rules then follows the service related documentation. A more likely scenario is that the rules are not fully understood and a discovery process must be used to determine which rules are associated with a service. Some discovery is straight forward. If a host is decommissioned, all rules associated with that host IP address should be candidates for removal. For organizations that use URL based load balancing rules, the load balancing rules must be unconfigured as elements of the web application are disabled or taken out of service.

Software Deprovisioning must occur whenever application software is no long required for production. Both the application software and its dependencies must be uninstalled. Uninstalling applications insures that the software cannot affect the availability or security of a system at some point in time in the future. Software deprovisioning is particularly important in the case of Internet facing applications. It isn't hard to imagine using as a search tool to find older, less secure versions of an application. If the old version is still hanging around on app servers, it would have a fair chance of being an unexpected conduit into the enterprise. (Of course that assumes that the old version of the application is more vulnerable to SQL injection, Cross Site Scripting, etc. The way things look today, that might not be a safe assumption.)

In some cases, such as with Perl CPAN or in the case Java run time proliferation, it is difficult to determine the application dependencies, hence it is difficult to decide if a particular JRE or CPAN module is no longer in service and can be deleted. Even if the original installation is properly documented, applications installed on the same server after the deprovisioned application may (or likely will) have undocumented dependencies on modules that were part of the original application. It tends to be difficult to sort application interdependencies unless some form of structured software installation is used and application interdependencies are clearly determined at installation time.

In any cases where the uninstalled software had a network interface (listening on a TCP or UDP port) there presumably is an associated firewall rule. In theory, every time a Windows service or Unix daemon is taken out of service, there should be corresponding firewall and/or load balancer change. Over the years I have had the opportunity to investigate many successful web server hacks. In one case, the hacker was kind enough to leave a message on the server. 'You only left one port open. One port was all I needed.'

Database Deprovisioning follows the deprovisioning of the related applications. The process is straightforward. Full (cold) backups are taken, the database is dropped, the listener is stopped, related data files, control files, scripts and utilities are removed or modified. Both the firewall rules and storage related to the database are dropped. There should be a one to one relationship between dropping a database and removing a firewall rule set.

Physical Server Decommissioning occurs when the last service that depends on the sever has been deprovisioned. Decommissioning of servers requires not only the obvious un-racking, cable removal, asset disposal and inventory updates, but also requires DNS updates, network deprovisioning, changes to monitoring systems and related documentation. Server decommissioning should automatically trigger the related firewall, network, DNS and monitoring changes required to deprovision the server related dependencies. If the server was SAN attached, the related storage must aslo be deprovisioned.

Virtual Sever Deprovisioning follows roughly the same rules as physical servers, minus the un-racking and un-cabling, but with the additional task of ensuring the archiving or disposal of old virtual machine clones, and the related task of unconfiguring any virtualized network interfaces or virtual security devices.

Network Deprovisioning is required any time that network connected hardware is decommissioned. Presumably your network managers are routinely disabling unused network ports, and hence need notification of network connections that are no longer in use. If servers are un-racked, then network cabling must be pulled out and network documentation must be updated. Network deprovisioning is an essential part of ensuring that future re-use of the network interface doesn't cause unanticipated security or configuration issues.

Storage Deprovisioning tends to be straight forward, and almost certainly the task that happens most reliably. Storage mangers, who are almost always are struggling to find space on SAN's, tend be pretty good at finding unused LUN's. Decommissioning SAN attached servers requires related switch fabric zone changes, switch port deprovisioning and documentation updates. Tagged on to storage deprovisioning is the final disposition of the last set of full backup and archive tapes, and if spindles are freed up, the wiping and/or disposal of the spindles is required.

Service Deprovisioning is the tail end of the Least Bit Principle, and assumes that you have some form of Structured System Management.