Skip to content

How We Got Here: Deployment Strategies

Arc: Deployment Eras covered: 6 Timeline: ~2005-2025 Read time: ~12 min


The Original Problem

In 2005, deploying a web application meant: put up a maintenance page, SSH into the production server, stop the application, copy the new files over the old ones, run database migrations, start the application, test it manually, take down the maintenance page. If something went wrong, your rollback plan was "restore from last night's backup." Deployments happened at 2 AM on Saturdays because that's when traffic was lowest and the damage from an outage was minimized.

Every deployment was a high-stakes event that required a change advisory board, a rollback plan, and someone's weekend. The result: teams deployed infrequently, which meant each deployment was larger, which made it riskier, which made teams deploy even less frequently. A vicious cycle.


Era 1: Big Bang Deployments (~2005-2008)

The Solution

There was no strategy — there was just "the deployment." Stop the old version, start the new version. The entire fleet was updated at once. Downtime was scheduled and communicated to users in advance. FTP or SCP was the deployment mechanism. The bravest teams used Capistrano (2006) to script the SSH commands.

What It Looked Like

# Capistrano deploy.rb (~2007)
set :application, "myapp"
set :repository,  "svn://svn.example.com/myapp/trunk"
set :deploy_to, "/var/www/myapp"
set :user, "deploy"

role :web, "web1.example.com", "web2.example.com"
role :app, "app1.example.com"
role :db,  "db1.example.com", :primary => true

# Deploy: cap deploy
# Rollback: cap deploy:rollback (symlinks to previous release)

Why It Was Better

  • Simple to understand — everyone knows "stop old, start new"
  • Complete consistency — every server runs exactly the same version
  • Capistrano added structure: releases directory, symlinks, rollback

Why It Wasn't Enough

  • Required downtime — users saw a maintenance page
  • All-or-nothing — one bad server meant a failed deployment for all
  • Rollback was slow and sometimes incomplete (database migrations)
  • Risk increased linearly with fleet size
  • Manual verification after deployment was error-prone

Legacy You'll Still See

Big bang deployments persist in on-prem enterprise software, embedded systems, and applications with complex database migrations that can't run alongside the old version. "Maintenance window" is still a term you'll hear at many companies.


Era 2: Blue-Green Deployments (~2008-2012)

The Solution

Martin Fowler and the ThoughtWorks team popularized blue-green deployments. You maintain two identical production environments — "blue" (current) and "green" (new). Deploy to green, test it, then switch the load balancer to point at green. If something goes wrong, switch back to blue. Zero downtime. Instant rollback.

What It Looked Like

# Blue-green with a load balancer
# Before deployment:
#   Load Balancer → Blue (v1.2, serving traffic)
#   Green (idle or running v1.1)

# Deployment:
# 1. Deploy v1.3 to Green
# 2. Run smoke tests against Green (direct access, not through LB)
# 3. Switch LB to Green
#    aws elb register-instances-with-load-balancer \
#      --load-balancer-name prod-lb \
#      --instances i-green-01 i-green-02
#    aws elb deregister-instances-from-load-balancer \
#      --load-balancer-name prod-lb \
#      --instances i-blue-01 i-blue-02
# 4. Monitor for 15 minutes
# 5. If problems: switch LB back to Blue (instant rollback)
# 6. If stable: Blue becomes the next deployment target

Why It Was Better

  • Zero-downtime deployment
  • Instant rollback — switch the load balancer back
  • Full environment testing before traffic switch
  • Clean separation between current and next version

Why It Wasn't Enough

  • Double the infrastructure cost (two full environments)
  • Database schema changes were still dangerous (both versions need to work with the schema)
  • "Instant rollback" only worked if you hadn't migrated the database
  • Switching all traffic at once still risked a 100% user impact for bugs that only appeared under real load
  • Stateful applications (sessions, caches) lost state on switch

Legacy You'll Still See

Blue-green is still widely used, especially in organizations with simple architectures and predictable traffic patterns. AWS Elastic Beanstalk's "swap environment URLs" is a built-in blue-green implementation. Many database migration strategies (expand-contract) were invented to make blue-green work with schema changes.


Era 3: Rolling Deployments (~2010-2015)

The Solution

Instead of switching all traffic at once, rolling deployments updated servers one at a time (or in small batches). The load balancer drained connections from a server, it was updated, health-checked, and returned to the pool. This was natural for auto-scaling groups and was built into every orchestration platform.

What It Looked Like

# Kubernetes rolling update (the default strategy)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1    # never take more than 1 pod out of service
      maxSurge: 1           # add at most 1 extra pod during update
  template:
    spec:
      containers:
        - name: web
          image: myapp:v1.3
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
# AWS Auto Scaling Group rolling update
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name prod-asg \
  --launch-template "LaunchTemplateName=web,Version=\$Latest"

# Instance refresh
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name prod-asg \
  --preferences '{"MinHealthyPercentage": 90}'

Why It Was Better

  • No double infrastructure cost
  • Gradual — problems affect a fraction of users before you notice
  • Health checks gate progression — bad pods don't get traffic
  • Built into Kubernetes, ASGs, ECS, and every modern platform
  • Natural fit for auto-scaling architectures

Why It Wasn't Enough

  • Slow for large fleets (updating 1000 servers one at a time)
  • Two versions run simultaneously — API and schema compatibility required
  • Rollback means rolling forward to the previous version (slow)
  • Health checks only catch crashes, not business logic bugs
  • No ability to target specific user segments for testing

Legacy You'll Still See

Rolling updates are the default deployment strategy in Kubernetes. If you don't specify a strategy, this is what you get. It's the right choice for most workloads, and the strategy most teams should start with.


Era 4: Canary Deployments (~2013-2018)

The Solution

Canary deployments (named after the mining practice of using canaries to detect gas) route a small percentage of traffic to the new version first. If metrics look good (error rate, latency, business KPIs), gradually increase the percentage. If anything goes wrong, route all traffic back to the stable version. Netflix and Google pioneered this at scale.

What It Looked Like

# Istio canary with traffic splitting
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
    - myapp.example.com
  http:
    - route:
        - destination:
            host: myapp
            subset: stable
          weight: 95
        - destination:
            host: myapp
            subset: canary
          weight: 5
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: myapp
spec:
  host: myapp
  subsets:
    - name: stable
      labels:
        version: v1.2
    - name: canary
      labels:
        version: v1.3

Why It Was Better

  • Blast radius is controlled: 5% of traffic means 5% of users affected
  • Real production traffic validates the new version (not synthetic tests)
  • Data-driven decisions: compare canary metrics against stable baseline
  • Rollback is instant — set canary weight to 0
  • Works for bugs that only appear under production traffic patterns

Why It Wasn't Enough

  • Required a service mesh or sophisticated load balancer
  • Metric collection and comparison needed tooling (automated analysis)
  • 5% of a high-traffic service is still thousands of affected users
  • Stateful services were complex (session affinity during canary)
  • Manual canary analysis was slow and error-prone

Legacy You'll Still See

Canary deployments are standard at companies running service meshes (Istio, Linkerd). The pattern is built into Argo Rollouts and Flagger. Most organizations with mature Kubernetes deployments use some form of canary analysis.


Era 5: Progressive Delivery (~2018-2023)

The Solution

Progressive delivery (coined by James Governor, RedMonk, 2018) automated the canary analysis loop. Tools like Argo Rollouts and Flagger defined the rollout steps, metrics to watch, and automatic promotion/rollback criteria. The human stepped back from the deployment and let the system decide whether to proceed or abort based on data.

What It Looked Like

# Argo Rollouts — automated canary with analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: myapp
        - setWeight: 25
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] > 0.99
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

Why It Was Better

  • Fully automated: no human in the loop for routine deployments
  • Metric-driven: promotion based on actual error rates and latency
  • Automatic rollback: if analysis fails, traffic routes back immediately
  • Configurable: teams define their own success criteria
  • Integrates with Prometheus, Datadog, New Relic for analysis

Why It Wasn't Enough

  • Requires mature observability (you need good metrics to analyze)
  • Analysis templates need careful tuning (false positives/negatives)
  • Complex failure modes (what if the analysis itself is wrong?)
  • Only works for Kubernetes workloads (Argo Rollouts, Flagger)
  • The tooling has a learning curve on top of Kubernetes

Legacy You'll Still See

Progressive delivery is the current best practice for mature Kubernetes deployments. Argo Rollouts is widely adopted. The pattern of "automated canary analysis" is becoming the expected standard for production-grade services.


Era 6: Feature Flags and Runtime Control (~2020-2025)

The Solution

Feature flags decoupled deployment from release. You deploy code to production with new features hidden behind flags. Enabling a feature is a configuration change, not a deployment. LaunchDarkly (2010, but mainstream adoption ~2020), Split.io, Unleash, and Flipper provide flag management platforms with user targeting, gradual rollouts, and instant kill switches.

What It Looked Like

# Feature flag in application code
from launchdarkly import LDClient

ld_client = LDClient("sdk-key-production")

def get_recommendations(user):
    # Check if this user should see the new recommendation engine
    if ld_client.variation("new-reco-engine", user_context(user), False):
        return new_recommendation_engine(user)
    else:
        return legacy_recommendation_engine(user)

# LaunchDarkly dashboard:
# - new-reco-engine: ON
#   - Target: 10% of users, all internal employees
#   - Ramp: increase by 10% every 2 hours if error rate < 0.1%

Why It Was Better

  • Deployment risk is near zero — you're deploying inert code
  • Instant rollback: flip the flag off, no redeployment needed
  • Targeted rollout: enable for internal users, beta users, 1% of traffic
  • Business-driven releases: product managers control when features go live
  • A/B testing is built in — compare metrics between flag states

Why It Wasn't Enough

  • Flag debt: old flags accumulate and create code complexity
  • Testing combinatorial explosion: N flags = 2^N possible states
  • Flag management platforms are another dependency and cost
  • "Flag-driven development" can mask poor architecture
  • Performance overhead from flag evaluation at runtime (usually negligible, but not always)

Legacy You'll Still See

Feature flags are mainstream and growing. Most large organizations use some form of feature flag system. The debate has shifted from "should we use feature flags?" to "how do we manage flag lifecycle and avoid flag debt?"


Where We Are Now

Most organizations use a combination: rolling updates as the default, canary for high-risk changes, feature flags for business-sensitive features. Progressive delivery tools automate the canary analysis. The "deploy on Friday" fear has been replaced by "deploy anytime" confidence at mature organizations — but many teams are still at the rolling-update-only stage.

Where It's Going

The convergence of feature flags, progressive delivery, and AI-powered analysis is the likely next step. Systems that automatically choose the right deployment strategy based on the change's risk profile — "this is a CSS change, just roll it; this touches the payment path, full canary with extended analysis." The goal is deployments that require zero human attention for routine changes.

The Pattern

Every generation reduces the blast radius of a bad deployment. From "all users at once" to "one server at a time" to "5% of traffic" to "users with a flag." The winning strategy is always the one that catches problems before they affect most users while adding the least friction to the deployment process.

Key Takeaway for Practitioners

Start with rolling updates and good health checks. That alone eliminates most deployment risk. Add canary deployments when you have the observability to support them. Add feature flags when the business needs to control release timing. Don't adopt complexity you can't operate.

Cross-References