Wednesday, March 25, 2009

Continuous Deployment – the Debate

Apparently, IMVU is rolling out fifty deployments a day. Continuous deployment at its finest, perhaps.

Michael Bolton at Developsense decided to look at what they are deploying. He found a couple dozen bugs in about as many minutes and concluded:
...there's such a strong fetish for the technology—the automated deployment—that what is being deployed is incidental to the conversation. Yes, folks, you can deploy 50 times a day. If you don't care about the quality of what you're deploying, you can meet any other requirement, to paraphrase Jerry Weinberg. If you're willing to settle for a system that looks like this and accept the risk of the Black Swan that manifests as a privacy or security or database-lclearing problem, you really don't need testers.
On the surface, it seems that the easier it is to deploy, the less time you'll spend on the quality of what you deploy. If a deploy is cheap, there is no reason to test. Just deploy. If the customers don't like it pull it back. If it's a Mars lander? Send another one. This line of thinking is captured in a comment by ‘Sho’ on a recent Rail Spikes post:
Sho - "The users would rather have an error every now and again on a site which innovates and progresses quickly rather than never any error on a site which sits still for months at a time because the developers can’t do a thing without rewriting hundreds upon hundreds of mundane low-level tests."
Send it out the door, fix it later? I know my users would disagree, but perhaps there are user communities out there who are different.

On the other hand - if you dig in to what Timothy Fritz wrote about IMVU's deployment process, you get an idea that it's not your grandfathers process:
"We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day."
Hmm... this just got interesting.
"code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts. "
Ahh….. they’ve wrapped a very extensive automated process around a deployment. Really cool.

What’s the catch? I've always wondered what hooks instant deployment has that prevent code rollouts from breaking the database. Turns out that the database is an exception:
Schema changes are done out of band. Just deploying them can be a huge pain. ......  It’s a two day affair, not something you roll back from lightly. In the end we have relatively standard practices for schemas
What about performance problems?
... like ‘this query you just wrote is a table scan’ or ‘this algorithm won’t scale to 1 million users’. Those kinds of issues tend to be the ones that won’t set off an automatic rollback because you don’t feel the pain until hours, days or weeks after you push them out to your cluster. Right now we use monitoring to pick up the slack, but it costs our operations team a lot of time.
How much time? Sounds like they save development time, but they make the DBA's and operations staff make up the difference?

Where does QA fit into this?
....we do have a Quality Assurance staff. There are numerous quality engineering tasks that only a human can do, exploratory testing and complaining when something is too complex come to mind. They're just not in the "take code and put it into production" process; it can't scale to require QA to look at every change so we don't bother. When working on new features (not fixing bugs, not refactoring, not making performance enhancements, not solving scalability bottlenecks etc), we'll have a controlled deliberate roll out plan that involves manual QE checks along the way, as well as a gradual roll-out and A/B testing.
So the 50 rollouts per day is only for bug fixes, performance enhancements and scalability, not for new features or schema changes. QA exists, but is not in the direct loop between the developers and production.

No word on security or compliance.

My old school conservatism tells me that the more often you let new code on to your servers, the more often you have an opportunity to shoot yourself in the foot. Continuous deployment looks to me like ready-fire-aim with the gun pointed at your privates. There may be a case for it though: a bug tolerant customer base that has a short attention span (the IMVU customers are probably ideal for this), no regulatory or compliance considerations, no or minimal security considerations, no database schema changes, no major upgrades, no new features, a small, highly motivated development team, extremely good fully automated regression testing, fully automated deployment with automated A/B testing, automated baselining, automated rollback, etc.

Unfortunately, all that detail is lost in the conversation.



Related: Why Instant Deployment Matters and Deployment that Just Works, James @ Heroku

7 comments:

  1. you might as well just do development on the public server, or link the public tools to your compile directory.

    ReplyDelete
  2. I like that idea - Just point the Apache or IIS config at your dev directory and call it a day. Every Ctrl-S is a deploy.

    :-)

    ReplyDelete
  3. I would suggest that the details emerge in the conversation, and I appreciate your analysis. I'll continue to suggest that the number of deployments a day is orthogonal to customer value—and isn't that what we're supposed to be providing?

    Cheers,

    ---Michael B.

    ReplyDelete
  4. We deploy many times a day at Flickr. We've been pretty public about how we do this:

    see:
    http://blip.tv/file/2284377/ as well as:
    http://code.flickr.com/blog/2009/12/02/flipping-out/

    as to some of your remarks:

    - "a bug tolerant customer base that has a short attention span (the IMVU customers are probably ideal for this)"

    I will go out on a limb and say that Flickr indeed has an extremely passionate and long-memory userbase with low tolerance for bugs.

    - "no regulatory or compliance considerations, no or minimal security considerations"

    Nope. Both taken very seriously. Flickr sells pro subscriptions via paypal, and have the Yahoo security group's eyes on us all the time.

    - "no database schema changes"

    Nope, done all the time. However, database schema changes are *not* done in this same way that php code is deployed. It's done with a much more stringent change management process.

    - "no major upgrades"

    Nope. MySQL, php, apache, squid...all have been upgraded. Applications are upgraded quite differently than new php code deploys. See above about schema changes.

    - "no new features"

    Nope. In the past few years we've launched many large features including: 'people tagging', complete search UI and backend overhaul, galleries, API app garden, video, as well as others.

    - "a small, highly motivated development team"

    Small? No. Not huge, though. Highly motivated and very smart? Absolutely.

    - "extremely good fully automated regression testing"

    Not really. Only recently (2009) started integrating Hudson CI, but nothing more automated than sharp people watching resource usage immediately after deploy.

    - "fully automated deployment with automated A/B testing, automated baselining, automated rollback, etc."

    I can count on two hands how many times in the past 5 years we've technically rolled 'back', as opposed to roll forward to fix bugs found in a recent deploy.

    Small and frequent changes (~3300 deploys in 2009 from Jan to Nov) has been a boon to my job as Operations Engineering manager, and my responsibility for performance and reliability SLAs.

    It works for us. Will it work for everyone? No. But at this point I can't ever imagine working any other way, and neither can any member of the Flickr Engineering team.

    ReplyDelete
  5. John -

    Apologies for the delay in responding to you comment....

    As a person whose team has to work day and night recovering from bad code deploys, as we had last week, I still am failing to understand how 3000 deploys per year makes operation easier, rather than harder.

    - Question #1 - What percentage of the 3000 deploys introduced bugs that needed to be fixed on a future deploy? In my world, those are product defects, or what manufacturing calls 're-work'. If that percentage were small, say on the order of 1 out of 1000 deploys, I'd be less reserved about continuous deployment. In my world, About 1/4 of deploys require follow-on bug fixes.

    - Question #2 - What percentage of deploys require schema changes? It seems like the smaller the percentage, the more value in continuous deployment. In my world, somewhere around 25% of deploys have associated schema changes, so we'd have to use conventional deployment many times anyway.

    - Question #3 - When you have a performance problem, how do you associate the problem with a specific change(deploy)? We are cyclical, with peaks loads twice per year at up 10x normal daily load. For us, it's essential that we be able to tie a new performance event back to a specific deploy. With weekly deploys, that's not hard. With hourly deploys, it sounds much harder. Watching closely after a deploy doesn't help, because the problem likely will not surface until the next 'Black Friday' (semester startup) 1-5 months later.

    - Question #4 - 'Regulatory and Compliance' to me means PCI, FISMA, or something roughly equivalent. I take it that Flickr doesn't directly handle payment cards - it appears as though you hand that off to Yahoo or Paypal. If Flickr directly handled cards and was stuck with PCI, would you still use some form of continuous deployment, or would the change management, vulnerability scanning, pen testing and code review that typically comes along with compliance frameworks be too restrictive?

    I suspect that above all, smart programmers and close dev/ops integration are the critical factors. If - when things go bad - you have dev that blames ops or vis versa, this isn't going to work.

    ReplyDelete
  6. Michael - all reasonable questions, let me take a stab.

    First, though: my point in commenting here is to not pontificate that continuous deployment should be the way for anyone, and I"m not out to convince you that you should work this way whatsoever. Just as some of your comments point out, there are many application and environmental influences on whether or not this process works. At Flickr, I was lucky to be part of a group where development and operations got along amazingly well, and communicated and collaborated in a respectful and fruitful way every day. We deployed change this way because it worked for us, we found great advantages in it, and can't imagine working any other way. But that fact doesn't in any way dictate that anyone else should follow what we did.

    My comments are only meant to clarify why and how it worked for Flickr, and to say that IMVU isn't the only place that has had success with it.

    "- Question #1 - What percentage of the 3000 deploys introduced bugs that needed to be fixed on a future deploy?"

    Not many at all, and the reason for that is that each deploy only has a small handful of changes, made by a small handful of people who just committed the code an hour or less before they were deployed. Bugs will be bugs, no matter how and when you deploy new code; continuous deployment changes the frequency and size of the change, which I will posit does affect quality for the better. This, IMHO, is simply because each of those 3000 changes were tiny, so the developer didn't have huge changeset to test and/or think about.

    "- Question #2 - What percentage of deploys require schema changes? It seems like the smaller the percentage, the more value in continuous deployment."

    Very few. Database alters aren't taken lightly, since they are largely irreversible and need more scrutiny than code changes. If I had to pick which change required more traditional "Change Management" rigor, it would be schema changes. Every other week we would make database alters if they were needed, and they were done in a careful window of time. Done in a dev environment first, and sanity-checked by more than one engineer.

    ReplyDelete
  7. (continued because my comment was too long) :)

    "- Question #3 - When you have a performance problem, how do you associate the problem with a specific change(deploy)?"

    This is actually where the process shines for Flickr. We log every time a deploy happens, who deployed it, and what the deploy contained, including who made what changes to what files, and what those exact diffs are. In addition, we included the timestamp of the deploy at the top of all our monitoring and metrics collection tools, so correlation is extremely easy. Here's where your concerns will be different: while Flickr does have relatively higher peaks in different parts of the year (holidays, end of summer, etc.) those peaks aren't significantly larger than the weekly peak at any given time of the year, and the weekly peak isn't drastically different from any other day's peak.
    So we don't have much time between deploy time and increasing load to watch for regressions. You obviously do, and would have to treat those changes with more care.

    "- Question #4 - 'Regulatory and Compliance' to me means PCI, FISMA, or something roughly equivalent. I take it that Flickr doesn't directly handle payment cards - it appears as though you hand that off to Yahoo or Paypal. If Flickr directly handled cards and was stuck with PCI, would you still use some form of continuous deployment, or would the change management, vulnerability scanning, pen testing and code review that typically comes along with compliance frameworks be too restrictive?"

    Just with the schema changes above, some changes have different amounts of risk and therefore might require different amounts of careful inspection and change management (rollback procedures, etc.) When I say "code deploys", as in many times per day, I mean front-end php code. I don't mean apache modules, Java queueing daemons, MySQL version upgrades, etc.

    No part of continuous deployment should be understood as an insistence on 'cowboy culture'; what regulates the speed of change is our history. If our MTTR and MTTD is climbing, we're moving too fast with whatever change introduced those incidents. If those metrics are low, then we're doing ok. In the end, the user experience guides our way.

    "I suspect that above all, smart programmers and close dev/ops integration are the critical factors. If - when things go bad - you have dev that blames ops or vis versa, this isn't going to work."

    Your suspicion is exactly right-on. There's no way that our process would work if we lived in a fingerpointy culture, and there wasn't a tight understanding that *everyone* (dev and ops) owns availability and performance. Not one single team.

    ...and yes, building that type of culture is hard. But we managed to do it with our group.

    I just recently talked about this topic: http://www.redmonk.com/cote/2010/02/11/agileexec008/

    - john

    ReplyDelete