The Rollback That Wasn't
- lesson
- deploy-strategies
- rollback
- blue-green
- canary
- feature-flags
- database-migrations
- l2 ---# The Rollback That Wasn't
Topics: deploy strategies, rollback, blue-green, canary, feature flags, database migrations Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic deploy experience helpful
The Mission¶
The deploy went bad. Error rate is climbing. You confidently type kubectl rollout undo
and wait for the fix. The old version is running again. But the errors continue. The
rollback... didn't work.
This happens more often than anyone admits. Rollbacks fail because deploys change more than
code: database schemas, configuration, feature flags, API contracts, and cached state all
evolve with deploys — and none of them roll back with kubectl rollout undo.
This lesson teaches why rollbacks fail, what each deploy strategy actually gives you, and how to build deploys that can actually be reversed.
Why Rollbacks Fail¶
Cause 1: Database migrations don't undo¶
v1 code: SELECT name, email FROM users
v2 code: SELECT name, email, phone FROM users
v2 migration: ALTER TABLE users ADD COLUMN phone VARCHAR(20)
You deploy v2 (migration runs, adds phone column), then rollback to v1. The code is v1
again. But the database still has the phone column. v1 doesn't know about it, but that's
fine — SELECT ignores extra columns.
Now consider the reverse:
v1 code: SELECT name, email, phone FROM users
v2 code: SELECT name, email FROM users
v2 migration: ALTER TABLE users DROP COLUMN phone
Deploy v2 (migration drops phone). Rollback to v1. v1 tries SELECT phone on a table
that no longer has it. Error. The rollback broke things worse.
Mental Model: Code rollbacks are instant. Database rollbacks are often impossible. Treat every migration as a one-way door. Design migrations so that both the old and new code versions work against the current schema.
The safe migration pattern¶
Expand-and-contract (also called "parallel change"):
Step 1: Deploy v2 code that reads phone IF EXISTS but doesn't require it
(both v1 and v2 work against current schema)
Step 2: Run migration to add phone column
(both v1 and v2 still work)
Step 3: Deploy v3 code that uses phone column
Step 4: Later, remove old code paths
At every step, rolling back to the previous version is safe because the schema supports both.
Cause 2: Config changes live outside the deploy¶
# Config deployed separately from code
# If code v2 depends on a new config key that v1 doesn't know about...
# Rolling back code to v1 doesn't roll back the config
API_VERSION: "v2"
NEW_FEATURE_ENDPOINT: "/api/v2/payments"
Rollback to v1 code. Config still says API_VERSION: "v2". v1 code reads API_VERSION
and gets a value it doesn't understand.
Fix: Config changes should be backward-compatible, or tied to the code deploy (not managed separately in ConfigMaps).
Cause 3: API contracts broke downstream¶
v2 changed a JSON response field from user_name to username. Mobile app cached the v2
response format. Rolling back to v1 (which returns user_name) breaks the mobile app's
cached expectations.
Cause 4: Caches are stale¶
v2 wrote data to Redis in a new format. Rollback to v1, which can't parse the v2 format. Every cache read fails until the v2 entries expire.
Deploy Strategies: What Each One Gives You¶
Rolling update (Kubernetes default)¶
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # 1 extra pod during update
maxUnavailable: 0 # No downtime
Old and new versions run simultaneously during the rollout.
| Pros | Cons |
|---|---|
| Zero downtime | Both versions serve traffic simultaneously |
| Built into Kubernetes | Rollback is slow (same gradual process) |
| Simple | Can't test new version before traffic hits it |
Rollback: kubectl rollout undo — starts a new rolling update to the old version.
Takes the same time as the original deploy.
Blue-Green¶
Two identical environments. Blue is live. Deploy to green. Test green. Switch traffic.
Before: Traffic → Blue (v1) Green (idle)
Deploy: Traffic → Blue (v1) Green (v2) ← deploy + test here
Switch: Traffic → Green (v2) Blue (v1) ← instant switch
Rollback: Traffic → Blue (v1) Green (v2) ← instant switch back
| Pros | Cons |
|---|---|
| Instant rollback (switch traffic back) | Double the infrastructure cost |
| Test new version before traffic | Database migrations affect both environments |
| No mixed-version traffic | Requires sophisticated traffic routing |
Rollback: Instant — switch traffic back to blue. But database migrations still can't be undone.
Canary¶
Route a small percentage of traffic to the new version. Monitor. Gradually increase.
Step 1: 5% → v2, 95% → v1 ← watch error rate
Step 2: 25% → v2, 75% → v1 ← watch latency
Step 3: 50% → v2, 50% → v1 ← watch everything
Step 4: 100% → v2 ← full rollout
| Pros | Cons |
|---|---|
| Limits blast radius | Requires traffic splitting infrastructure |
| Real production validation | Mixed-version traffic (API compatibility) |
| Data-driven decisions | Slower rollout |
Rollback: Route 100% back to v1. Fast, but any data written by the canary v2 remains.
Feature flags¶
The code is deployed, but new behavior is behind a flag that can be toggled without a deploy:
if feature_flags.is_enabled("new-payment-flow", user=current_user):
return new_payment_flow(order)
else:
return old_payment_flow(order)
| Pros | Cons |
|---|---|
| Instant enable/disable, no deploy | Code complexity (two paths to maintain) |
| Can target specific users or % | Flag debt accumulates if not cleaned up |
| Decouple deploy from release | Testing both paths is harder |
Rollback: Flip the flag. Instant. The code stays deployed but the new behavior is off. No database migration to undo, no traffic routing to change.
Building Rollback-Safe Deploys¶
Rule 1: Database migrations must be backward-compatible¶
Both the old and new code versions should work against the current schema. Use expand-and-contract:
# Adding a column: safe (old code ignores it)
# Removing a column: NOT safe (old code needs it)
# Renaming a column: NOT safe (old code uses old name)
# Changing a column type: depends on the change
# Safe: add new column, have both old and new code work, then
# remove old column in a LATER deploy
Rule 2: Config changes should be backward-compatible¶
New config keys should have defaults. Old code should ignore unknown keys. Never change the meaning of an existing key.
Rule 3: API changes should be additive¶
Add new fields; don't rename or remove existing ones. If you must make breaking changes,
version the API (/v1/, /v2/).
Rule 4: Test the rollback¶
# Deploy v2
kubectl apply -f v2-deployment.yaml
# Verify it works
# Rollback to v1
kubectl rollout undo deployment/myapp
# Verify v1 still works against the current state:
# - Can v1 read v2's database schema?
# - Can v1 handle v2's cached data?
# - Can v1 parse v2's config?
If you don't test the rollback, you don't have a rollback.
Flashcard Check¶
Q1: You rollback the code but errors continue. What's the most likely cause?
A database migration that ran during the deploy. Code is v1 but the schema is v2. If v1 can't work against v2's schema, the rollback is broken.
Q2: What is expand-and-contract for database migrations?
Add new schema elements while keeping old code compatible. Then update code. Then remove old schema elements in a later deploy. At every step, rollback is safe.
Q3: Blue-green vs canary — when to use which?
Blue-green: when you need instant rollback and can afford double infrastructure. Canary: when you want to validate with real traffic and limit blast radius.
Q4: Feature flags — what's the trade-off?
Instant rollback (flip the flag) without a deploy. But code complexity increases (two paths), and flags accumulate as technical debt if not cleaned up.
Q5: What makes a deploy "rollback-safe"?
Database migrations backward-compatible, config changes have defaults, API changes are additive, and the rollback is tested before relying on it.
Cheat Sheet¶
Deploy Strategy Quick Reference¶
| Strategy | Rollback speed | Infrastructure cost | Blast radius |
|---|---|---|---|
| Rolling update | Minutes | 1x + surge | 100% during transition |
| Blue-green | Instant | 2x | 0% (test before switch) |
| Canary | Fast | 1x + canary | 5-50% (controlled) |
| Feature flag | Instant | 1x | Per-user/percentage |
Migration Safety Rules¶
| Migration type | Rollback-safe? | How to make safe |
|---|---|---|
| Add column | Yes | Old code ignores it |
| Remove column | No | Add column first, remove in later deploy |
| Rename column | No | Add new, copy data, update code, drop old |
| Add index | Yes (CONCURRENTLY) |
Use CREATE INDEX CONCURRENTLY |
| Change column type | Depends | Add new column, migrate data, swap |
Takeaways¶
-
Code rollbacks are easy. State rollbacks are hard. Database schemas, caches, config, and API contracts don't rollback with
kubectl rollout undo. -
Every migration is a one-way door. Design migrations so both old and new code work against the current schema. Expand-and-contract is the pattern.
-
Feature flags give the best rollback. Toggle behavior without deploys, without migrations, without downtime. But clean up old flags.
-
If you don't test the rollback, you don't have one. Deploy v2, then rollback to v1, and verify everything works. Make this part of the deploy process.
-
The rollback that fails at 3 AM is the one you never tested at 3 PM.
Related Lessons¶
- The Cascading Timeout — when the bad deploy cascades through services
- The Database That Wouldn't Start — PostgreSQL and MySQL recovery
- How Incident Response Actually Works — when rollback is step 1 of the 3Rs