Wednesday, March 25, 2009

Continuous Deployment – the Debate

Apparently, IMVU is rolling out fifty deployments a day. Continuous deployment at its finest, perhaps.

Michael Bolton at Developsense decided to look at what they are deploying. He found a couple dozen bugs in about as many minutes and concluded:
...there's such a strong fetish for the technology—the automated deployment—that what is being deployed is incidental to the conversation. Yes, folks, you can deploy 50 times a day. If you don't care about the quality of what you're deploying, you can meet any other requirement, to paraphrase Jerry Weinberg. If you're willing to settle for a system that looks like this and accept the risk of the Black Swan that manifests as a privacy or security or database-lclearing problem, you really don't need testers.
On the surface, it seems that the easier it is to deploy, the less time you'll spend on the quality of what you deploy. If a deploy is cheap, there is no reason to test. Just deploy. If the customers don't like it pull it back. If it's a Mars lander? Send another one. This line of thinking is captured in a comment by ‘Sho’ on a recent Rail Spikes post:
Sho - "The users would rather have an error every now and again on a site which innovates and progresses quickly rather than never any error on a site which sits still for months at a time because the developers can’t do a thing without rewriting hundreds upon hundreds of mundane low-level tests."
Send it out the door, fix it later? I know my users would disagree, but perhaps there are user communities out there who are different.

On the other hand - if you dig in to what Timothy Fritz wrote about IMVU's deployment process, you get an idea that it's not your grandfathers process:
"We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day."
Hmm... this just got interesting.
"code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts. "
Ahh….. they’ve wrapped a very extensive automated process around a deployment. Really cool.

What’s the catch? I've always wondered what hooks instant deployment has that prevent code rollouts from breaking the database. Turns out that the database is an exception:
Schema changes are done out of band. Just deploying them can be a huge pain. ......  It’s a two day affair, not something you roll back from lightly. In the end we have relatively standard practices for schemas
What about performance problems?
... like ‘this query you just wrote is a table scan’ or ‘this algorithm won’t scale to 1 million users’. Those kinds of issues tend to be the ones that won’t set off an automatic rollback because you don’t feel the pain until hours, days or weeks after you push them out to your cluster. Right now we use monitoring to pick up the slack, but it costs our operations team a lot of time.
How much time? Sounds like they save development time, but they make the DBA's and operations staff make up the difference?

Where does QA fit into this?
....we do have a Quality Assurance staff. There are numerous quality engineering tasks that only a human can do, exploratory testing and complaining when something is too complex come to mind. They're just not in the "take code and put it into production" process; it can't scale to require QA to look at every change so we don't bother. When working on new features (not fixing bugs, not refactoring, not making performance enhancements, not solving scalability bottlenecks etc), we'll have a controlled deliberate roll out plan that involves manual QE checks along the way, as well as a gradual roll-out and A/B testing.
So the 50 rollouts per day is only for bug fixes, performance enhancements and scalability, not for new features or schema changes. QA exists, but is not in the direct loop between the developers and production.

No word on security or compliance.

My old school conservatism tells me that the more often you let new code on to your servers, the more often you have an opportunity to shoot yourself in the foot. Continuous deployment looks to me like ready-fire-aim with the gun pointed at your privates. There may be a case for it though: a bug tolerant customer base that has a short attention span (the IMVU customers are probably ideal for this), no regulatory or compliance considerations, no or minimal security considerations, no database schema changes, no major upgrades, no new features, a small, highly motivated development team, extremely good fully automated regression testing, fully automated deployment with automated A/B testing, automated baselining, automated rollback, etc.

Unfortunately, all that detail is lost in the conversation.

Related: Why Instant Deployment Matters and Deployment that Just Works, James @ Heroku