How We Got Here: Deployment Strategies¶

Arc: Deployment Eras covered: 6 Timeline: ~2005-2025 Read time: ~12 min

The Original Problem¶

In 2005, deploying a web application meant: put up a maintenance page, SSH into the production server, stop the application, copy the new files over the old ones, run database migrations, start the application, test it manually, take down the maintenance page. If something went wrong, your rollback plan was "restore from last night's backup." Deployments happened at 2 AM on Saturdays because that's when traffic was lowest and the damage from an outage was minimized.

Every deployment was a high-stakes event that required a change advisory board, a rollback plan, and someone's weekend. The result: teams deployed infrequently, which meant each deployment was larger, which made it riskier, which made teams deploy even less frequently. A vicious cycle.

Era 1: Big Bang Deployments (~2005-2008)¶

The Solution¶

There was no strategy — there was just "the deployment." Stop the old version, start the new version. The entire fleet was updated at once. Downtime was scheduled and communicated to users in advance. FTP or SCP was the deployment mechanism. The bravest teams used Capistrano (2006) to script the SSH commands.

What It Looked Like¶

# Capistrano deploy.rb (~2007)
set :application, "myapp"
set :repository,  "svn://svn.example.com/myapp/trunk"
set :deploy_to, "/var/www/myapp"
set :user, "deploy"

role :web, "web1.example.com", "web2.example.com"
role :app, "app1.example.com"
role :db,  "db1.example.com", :primary => true

# Deploy: cap deploy
# Rollback: cap deploy:rollback (symlinks to previous release)

Why It Was Better¶

Simple to understand — everyone knows "stop old, start new"
Complete consistency — every server runs exactly the same version
Capistrano added structure: releases directory, symlinks, rollback

Why It Wasn't Enough¶

Required downtime — users saw a maintenance page
All-or-nothing — one bad server meant a failed deployment for all
Rollback was slow and sometimes incomplete (database migrations)
Risk increased linearly with fleet size
Manual verification after deployment was error-prone

Legacy You'll Still See¶

Big bang deployments persist in on-prem enterprise software, embedded systems, and applications with complex database migrations that can't run alongside the old version. "Maintenance window" is still a term you'll hear at many companies.

Era 2: Blue-Green Deployments (~2008-2012)¶

The Solution¶

Martin Fowler and the ThoughtWorks team popularized blue-green deployments. You maintain two identical production environments — "blue" (current) and "green" (new). Deploy to green, test it, then switch the load balancer to point at green. If something goes wrong, switch back to blue. Zero downtime. Instant rollback.

What It Looked Like¶

# Blue-green with a load balancer
# Before deployment:
#   Load Balancer → Blue (v1.2, serving traffic)
#   Green (idle or running v1.1)

# Deployment:
# 1. Deploy v1.3 to Green
# 2. Run smoke tests against Green (direct access, not through LB)
# 3. Switch LB to Green
#    aws elb register-instances-with-load-balancer \
#      --load-balancer-name prod-lb \
#      --instances i-green-01 i-green-02
#    aws elb deregister-instances-from-load-balancer \
#      --load-balancer-name prod-lb \
#      --instances i-blue-01 i-blue-02
# 4. Monitor for 15 minutes
# 5. If problems: switch LB back to Blue (instant rollback)
# 6. If stable: Blue becomes the next deployment target

Why It Was Better¶

Zero-downtime deployment
Instant rollback — switch the load balancer back
Full environment testing before traffic switch
Clean separation between current and next version

Why It Wasn't Enough¶

Double the infrastructure cost (two full environments)
Database schema changes were still dangerous (both versions need to work with the schema)
"Instant rollback" only worked if you hadn't migrated the database
Switching all traffic at once still risked a 100% user impact for bugs that only appeared under real load
Stateful applications (sessions, caches) lost state on switch

Legacy You'll Still See¶

Blue-green is still widely used, especially in organizations with simple architectures and predictable traffic patterns. AWS Elastic Beanstalk's "swap environment URLs" is a built-in blue-green implementation. Many database migration strategies (expand-contract) were invented to make blue-green work with schema changes.

Era 3: Rolling Deployments (~2010-2015)¶

The Solution¶

Instead of switching all traffic at once, rolling deployments updated servers one at a time (or in small batches). The load balancer drained connections from a server, it was updated, health-checked, and returned to the pool. This was natural for auto-scaling groups and was built into every orchestration platform.

What It Looked Like¶

# Kubernetes rolling update (the default strategy)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1    # never take more than 1 pod out of service
      maxSurge: 1           # add at most 1 extra pod during update
  template:
    spec:
      containers:
        - name: web
          image: myapp:v1.3
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5

# AWS Auto Scaling Group rolling update
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name prod-asg \
  --launch-template "LaunchTemplateName=web,Version=\$Latest"

# Instance refresh
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name prod-asg \
  --preferences '{"MinHealthyPercentage": 90}'

Why It Was Better¶

No double infrastructure cost
Gradual — problems affect a fraction of users before you notice
Health checks gate progression — bad pods don't get traffic
Built into Kubernetes, ASGs, ECS, and every modern platform
Natural fit for auto-scaling architectures

Why It Wasn't Enough¶

Slow for large fleets (updating 1000 servers one at a time)
Two versions run simultaneously — API and schema compatibility required
Rollback means rolling forward to the previous version (slow)
Health checks only catch crashes, not business logic bugs
No ability to target specific user segments for testing

Legacy You'll Still See¶

Rolling updates are the default deployment strategy in Kubernetes. If you don't specify a strategy, this is what you get. It's the right choice for most workloads, and the strategy most teams should start with.

Era 4: Canary Deployments (~2013-2018)¶

The Solution¶

Canary deployments (named after the mining practice of using canaries to detect gas) route a small percentage of traffic to the new version first. If metrics look good (error rate, latency, business KPIs), gradually increase the percentage. If anything goes wrong, route all traffic back to the stable version. Netflix and Google pioneered this at scale.

What It Looked Like¶

# Istio canary with traffic splitting
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
    - myapp.example.com
  http:
    - route:
        - destination:
            host: myapp
            subset: stable
          weight: 95
        - destination:
            host: myapp
            subset: canary
          weight: 5
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: myapp
spec:
  host: myapp
  subsets:
    - name: stable
      labels:
        version: v1.2
    - name: canary
      labels:
        version: v1.3

Why It Was Better¶

Blast radius is controlled: 5% of traffic means 5% of users affected
Real production traffic validates the new version (not synthetic tests)
Data-driven decisions: compare canary metrics against stable baseline
Rollback is instant — set canary weight to 0
Works for bugs that only appear under production traffic patterns

Why It Wasn't Enough¶

Required a service mesh or sophisticated load balancer
Metric collection and comparison needed tooling (automated analysis)
5% of a high-traffic service is still thousands of affected users
Stateful services were complex (session affinity during canary)
Manual canary analysis was slow and error-prone

Legacy You'll Still See¶

Canary deployments are standard at companies running service meshes (Istio, Linkerd). The pattern is built into Argo Rollouts and Flagger. Most organizations with mature Kubernetes deployments use some form of canary analysis.

Era 5: Progressive Delivery (~2018-2023)¶

The Solution¶

Progressive delivery (coined by James Governor, RedMonk, 2018) automated the canary analysis loop. Tools like Argo Rollouts and Flagger defined the rollout steps, metrics to watch, and automatic promotion/rollback criteria. The human stepped back from the deployment and let the system decide whether to proceed or abort based on data.

What It Looked Like¶

# Argo Rollouts — automated canary with analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: myapp
        - setWeight: 25
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] > 0.99
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

Why It Was Better¶

Fully automated: no human in the loop for routine deployments
Metric-driven: promotion based on actual error rates and latency
Automatic rollback: if analysis fails, traffic routes back immediately
Configurable: teams define their own success criteria
Integrates with Prometheus, Datadog, New Relic for analysis

Why It Wasn't Enough¶

Requires mature observability (you need good metrics to analyze)
Analysis templates need careful tuning (false positives/negatives)
Complex failure modes (what if the analysis itself is wrong?)
Only works for Kubernetes workloads (Argo Rollouts, Flagger)
The tooling has a learning curve on top of Kubernetes

Legacy You'll Still See¶

Progressive delivery is the current best practice for mature Kubernetes deployments. Argo Rollouts is widely adopted. The pattern of "automated canary analysis" is becoming the expected standard for production-grade services.

Era 6: Feature Flags and Runtime Control (~2020-2025)¶

The Solution¶

Feature flags decoupled deployment from release. You deploy code to production with new features hidden behind flags. Enabling a feature is a configuration change, not a deployment. LaunchDarkly (2010, but mainstream adoption ~2020), Split.io, Unleash, and Flipper provide flag management platforms with user targeting, gradual rollouts, and instant kill switches.

What It Looked Like¶

# Feature flag in application code
from launchdarkly import LDClient

ld_client = LDClient("sdk-key-production")

def get_recommendations(user):
    # Check if this user should see the new recommendation engine
    if ld_client.variation("new-reco-engine", user_context(user), False):
        return new_recommendation_engine(user)
    else:
        return legacy_recommendation_engine(user)

# LaunchDarkly dashboard:
# - new-reco-engine: ON
#   - Target: 10% of users, all internal employees
#   - Ramp: increase by 10% every 2 hours if error rate < 0.1%

Why It Was Better¶

Deployment risk is near zero — you're deploying inert code
Instant rollback: flip the flag off, no redeployment needed
Targeted rollout: enable for internal users, beta users, 1% of traffic
Business-driven releases: product managers control when features go live
A/B testing is built in — compare metrics between flag states

Why It Wasn't Enough¶

Flag debt: old flags accumulate and create code complexity
Testing combinatorial explosion: N flags = 2^N possible states
Flag management platforms are another dependency and cost
"Flag-driven development" can mask poor architecture
Performance overhead from flag evaluation at runtime (usually negligible, but not always)

Legacy You'll Still See¶

Feature flags are mainstream and growing. Most large organizations use some form of feature flag system. The debate has shifted from "should we use feature flags?" to "how do we manage flag lifecycle and avoid flag debt?"

Where We Are Now¶

Most organizations use a combination: rolling updates as the default, canary for high-risk changes, feature flags for business-sensitive features. Progressive delivery tools automate the canary analysis. The "deploy on Friday" fear has been replaced by "deploy anytime" confidence at mature organizations — but many teams are still at the rolling-update-only stage.

Where It's Going¶

The convergence of feature flags, progressive delivery, and AI-powered analysis is the likely next step. Systems that automatically choose the right deployment strategy based on the change's risk profile — "this is a CSS change, just roll it; this touches the payment path, full canary with extended analysis." The goal is deployments that require zero human attention for routine changes.

The Pattern¶

Every generation reduces the blast radius of a bad deployment. From "all users at once" to "one server at a time" to "5% of traffic" to "users with a flag." The winning strategy is always the one that catches problems before they affect most users while adding the least friction to the deployment process.

Key Takeaway for Practitioners¶

Start with rolling updates and good health checks. That alone eliminates most deployment risk. Add canary deployments when you have the observability to support them. Add feature flags when the business needs to control release timing. Don't adopt complexity you can't operate.

Cross-References¶

Topic Packs: Argo Rollouts, Istio, LaunchDarkly
Tool Comparisons: Deployment Strategies Compared
Evolution Guides: CI/CD Evolution, Monitoring Evolution

How We Got Here: Deployment Strategies¶

The Original Problem¶

Era 1: Big Bang Deployments (~2005-2008)¶

The Solution¶

What It Looked Like¶

Why It Was Better¶

Why It Wasn't Enough¶

Legacy You'll Still See¶

Era 2: Blue-Green Deployments (~2008-2012)¶

The Solution¶

What It Looked Like¶

Why It Was Better¶

Why It Wasn't Enough¶

Legacy You'll Still See¶

Era 3: Rolling Deployments (~2010-2015)¶

The Solution¶

What It Looked Like¶

Why It Was Better¶

Why It Wasn't Enough¶

Legacy You'll Still See¶

Era 4: Canary Deployments (~2013-2018)¶

The Solution¶

What It Looked Like¶

Why It Was Better¶

Why It Wasn't Enough¶

Legacy You'll Still See¶

Era 5: Progressive Delivery (~2018-2023)¶

The Solution¶

What It Looked Like¶

Why It Was Better¶

Why It Wasn't Enough¶

Legacy You'll Still See¶

Era 6: Feature Flags and Runtime Control (~2020-2025)¶

The Solution¶

What It Looked Like¶

Why It Was Better¶

Why It Wasn't Enough¶

Legacy You'll Still See¶

Where We Are Now¶

Where It's Going¶

The Pattern¶

Key Takeaway for Practitioners¶

Cross-References¶

Pages that link here¶