How We Got Here: Deployment Strategies¶
Arc: Deployment Eras covered: 6 Timeline: ~2005-2025 Read time: ~12 min
The Original Problem¶
In 2005, deploying a web application meant: put up a maintenance page, SSH into the production server, stop the application, copy the new files over the old ones, run database migrations, start the application, test it manually, take down the maintenance page. If something went wrong, your rollback plan was "restore from last night's backup." Deployments happened at 2 AM on Saturdays because that's when traffic was lowest and the damage from an outage was minimized.
Every deployment was a high-stakes event that required a change advisory board, a rollback plan, and someone's weekend. The result: teams deployed infrequently, which meant each deployment was larger, which made it riskier, which made teams deploy even less frequently. A vicious cycle.
Era 1: Big Bang Deployments (~2005-2008)¶
The Solution¶
There was no strategy — there was just "the deployment." Stop the old version, start the new version. The entire fleet was updated at once. Downtime was scheduled and communicated to users in advance. FTP or SCP was the deployment mechanism. The bravest teams used Capistrano (2006) to script the SSH commands.
What It Looked Like¶
# Capistrano deploy.rb (~2007)
set :application, "myapp"
set :repository, "svn://svn.example.com/myapp/trunk"
set :deploy_to, "/var/www/myapp"
set :user, "deploy"
role :web, "web1.example.com", "web2.example.com"
role :app, "app1.example.com"
role :db, "db1.example.com", :primary => true
# Deploy: cap deploy
# Rollback: cap deploy:rollback (symlinks to previous release)
Why It Was Better¶
- Simple to understand — everyone knows "stop old, start new"
- Complete consistency — every server runs exactly the same version
- Capistrano added structure: releases directory, symlinks, rollback
Why It Wasn't Enough¶
- Required downtime — users saw a maintenance page
- All-or-nothing — one bad server meant a failed deployment for all
- Rollback was slow and sometimes incomplete (database migrations)
- Risk increased linearly with fleet size
- Manual verification after deployment was error-prone
Legacy You'll Still See¶
Big bang deployments persist in on-prem enterprise software, embedded systems, and applications with complex database migrations that can't run alongside the old version. "Maintenance window" is still a term you'll hear at many companies.
Era 2: Blue-Green Deployments (~2008-2012)¶
The Solution¶
Martin Fowler and the ThoughtWorks team popularized blue-green deployments. You maintain two identical production environments — "blue" (current) and "green" (new). Deploy to green, test it, then switch the load balancer to point at green. If something goes wrong, switch back to blue. Zero downtime. Instant rollback.
What It Looked Like¶
# Blue-green with a load balancer
# Before deployment:
# Load Balancer → Blue (v1.2, serving traffic)
# Green (idle or running v1.1)
# Deployment:
# 1. Deploy v1.3 to Green
# 2. Run smoke tests against Green (direct access, not through LB)
# 3. Switch LB to Green
# aws elb register-instances-with-load-balancer \
# --load-balancer-name prod-lb \
# --instances i-green-01 i-green-02
# aws elb deregister-instances-from-load-balancer \
# --load-balancer-name prod-lb \
# --instances i-blue-01 i-blue-02
# 4. Monitor for 15 minutes
# 5. If problems: switch LB back to Blue (instant rollback)
# 6. If stable: Blue becomes the next deployment target
Why It Was Better¶
- Zero-downtime deployment
- Instant rollback — switch the load balancer back
- Full environment testing before traffic switch
- Clean separation between current and next version
Why It Wasn't Enough¶
- Double the infrastructure cost (two full environments)
- Database schema changes were still dangerous (both versions need to work with the schema)
- "Instant rollback" only worked if you hadn't migrated the database
- Switching all traffic at once still risked a 100% user impact for bugs that only appeared under real load
- Stateful applications (sessions, caches) lost state on switch
Legacy You'll Still See¶
Blue-green is still widely used, especially in organizations with simple architectures and predictable traffic patterns. AWS Elastic Beanstalk's "swap environment URLs" is a built-in blue-green implementation. Many database migration strategies (expand-contract) were invented to make blue-green work with schema changes.
Era 3: Rolling Deployments (~2010-2015)¶
The Solution¶
Instead of switching all traffic at once, rolling deployments updated servers one at a time (or in small batches). The load balancer drained connections from a server, it was updated, health-checked, and returned to the pool. This was natural for auto-scaling groups and was built into every orchestration platform.
What It Looked Like¶
# Kubernetes rolling update (the default strategy)
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # never take more than 1 pod out of service
maxSurge: 1 # add at most 1 extra pod during update
template:
spec:
containers:
- name: web
image: myapp:v1.3
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
# AWS Auto Scaling Group rolling update
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name prod-asg \
--launch-template "LaunchTemplateName=web,Version=\$Latest"
# Instance refresh
aws autoscaling start-instance-refresh \
--auto-scaling-group-name prod-asg \
--preferences '{"MinHealthyPercentage": 90}'
Why It Was Better¶
- No double infrastructure cost
- Gradual — problems affect a fraction of users before you notice
- Health checks gate progression — bad pods don't get traffic
- Built into Kubernetes, ASGs, ECS, and every modern platform
- Natural fit for auto-scaling architectures
Why It Wasn't Enough¶
- Slow for large fleets (updating 1000 servers one at a time)
- Two versions run simultaneously — API and schema compatibility required
- Rollback means rolling forward to the previous version (slow)
- Health checks only catch crashes, not business logic bugs
- No ability to target specific user segments for testing
Legacy You'll Still See¶
Rolling updates are the default deployment strategy in Kubernetes. If you don't specify a strategy, this is what you get. It's the right choice for most workloads, and the strategy most teams should start with.
Era 4: Canary Deployments (~2013-2018)¶
The Solution¶
Canary deployments (named after the mining practice of using canaries to detect gas) route a small percentage of traffic to the new version first. If metrics look good (error rate, latency, business KPIs), gradually increase the percentage. If anything goes wrong, route all traffic back to the stable version. Netflix and Google pioneered this at scale.
What It Looked Like¶
# Istio canary with traffic splitting
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp.example.com
http:
- route:
- destination:
host: myapp
subset: stable
weight: 95
- destination:
host: myapp
subset: canary
weight: 5
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: myapp
spec:
host: myapp
subsets:
- name: stable
labels:
version: v1.2
- name: canary
labels:
version: v1.3
Why It Was Better¶
- Blast radius is controlled: 5% of traffic means 5% of users affected
- Real production traffic validates the new version (not synthetic tests)
- Data-driven decisions: compare canary metrics against stable baseline
- Rollback is instant — set canary weight to 0
- Works for bugs that only appear under production traffic patterns
Why It Wasn't Enough¶
- Required a service mesh or sophisticated load balancer
- Metric collection and comparison needed tooling (automated analysis)
- 5% of a high-traffic service is still thousands of affected users
- Stateful services were complex (session affinity during canary)
- Manual canary analysis was slow and error-prone
Legacy You'll Still See¶
Canary deployments are standard at companies running service meshes (Istio, Linkerd). The pattern is built into Argo Rollouts and Flagger. Most organizations with mature Kubernetes deployments use some form of canary analysis.
Era 5: Progressive Delivery (~2018-2023)¶
The Solution¶
Progressive delivery (coined by James Governor, RedMonk, 2018) automated the canary analysis loop. Tools like Argo Rollouts and Flagger defined the rollout steps, metrics to watch, and automatic promotion/rollback criteria. The human stepped back from the deployment and let the system decide whether to proceed or abort based on data.
What It Looked Like¶
# Argo Rollouts — automated canary with analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: myapp
- setWeight: 25
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] > 0.99
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
Why It Was Better¶
- Fully automated: no human in the loop for routine deployments
- Metric-driven: promotion based on actual error rates and latency
- Automatic rollback: if analysis fails, traffic routes back immediately
- Configurable: teams define their own success criteria
- Integrates with Prometheus, Datadog, New Relic for analysis
Why It Wasn't Enough¶
- Requires mature observability (you need good metrics to analyze)
- Analysis templates need careful tuning (false positives/negatives)
- Complex failure modes (what if the analysis itself is wrong?)
- Only works for Kubernetes workloads (Argo Rollouts, Flagger)
- The tooling has a learning curve on top of Kubernetes
Legacy You'll Still See¶
Progressive delivery is the current best practice for mature Kubernetes deployments. Argo Rollouts is widely adopted. The pattern of "automated canary analysis" is becoming the expected standard for production-grade services.
Era 6: Feature Flags and Runtime Control (~2020-2025)¶
The Solution¶
Feature flags decoupled deployment from release. You deploy code to production with new features hidden behind flags. Enabling a feature is a configuration change, not a deployment. LaunchDarkly (2010, but mainstream adoption ~2020), Split.io, Unleash, and Flipper provide flag management platforms with user targeting, gradual rollouts, and instant kill switches.
What It Looked Like¶
# Feature flag in application code
from launchdarkly import LDClient
ld_client = LDClient("sdk-key-production")
def get_recommendations(user):
# Check if this user should see the new recommendation engine
if ld_client.variation("new-reco-engine", user_context(user), False):
return new_recommendation_engine(user)
else:
return legacy_recommendation_engine(user)
# LaunchDarkly dashboard:
# - new-reco-engine: ON
# - Target: 10% of users, all internal employees
# - Ramp: increase by 10% every 2 hours if error rate < 0.1%
Why It Was Better¶
- Deployment risk is near zero — you're deploying inert code
- Instant rollback: flip the flag off, no redeployment needed
- Targeted rollout: enable for internal users, beta users, 1% of traffic
- Business-driven releases: product managers control when features go live
- A/B testing is built in — compare metrics between flag states
Why It Wasn't Enough¶
- Flag debt: old flags accumulate and create code complexity
- Testing combinatorial explosion: N flags = 2^N possible states
- Flag management platforms are another dependency and cost
- "Flag-driven development" can mask poor architecture
- Performance overhead from flag evaluation at runtime (usually negligible, but not always)
Legacy You'll Still See¶
Feature flags are mainstream and growing. Most large organizations use some form of feature flag system. The debate has shifted from "should we use feature flags?" to "how do we manage flag lifecycle and avoid flag debt?"
Where We Are Now¶
Most organizations use a combination: rolling updates as the default, canary for high-risk changes, feature flags for business-sensitive features. Progressive delivery tools automate the canary analysis. The "deploy on Friday" fear has been replaced by "deploy anytime" confidence at mature organizations — but many teams are still at the rolling-update-only stage.
Where It's Going¶
The convergence of feature flags, progressive delivery, and AI-powered analysis is the likely next step. Systems that automatically choose the right deployment strategy based on the change's risk profile — "this is a CSS change, just roll it; this touches the payment path, full canary with extended analysis." The goal is deployments that require zero human attention for routine changes.
The Pattern¶
Every generation reduces the blast radius of a bad deployment. From "all users at once" to "one server at a time" to "5% of traffic" to "users with a flag." The winning strategy is always the one that catches problems before they affect most users while adding the least friction to the deployment process.
Key Takeaway for Practitioners¶
Start with rolling updates and good health checks. That alone eliminates most deployment risk. Add canary deployments when you have the observability to support them. Add feature flags when the business needs to control release timing. Don't adopt complexity you can't operate.
Cross-References¶
- Topic Packs: Argo Rollouts, Istio, LaunchDarkly
- Tool Comparisons: Deployment Strategies Compared
- Evolution Guides: CI/CD Evolution, Monitoring Evolution