The Rollback That Wasn't

lesson
deploy-strategies
rollback
blue-green
canary
feature-flags
database-migrations
l2 ---# The Rollback That Wasn't

Topics: deploy strategies, rollback, blue-green, canary, feature flags, database migrations Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic deploy experience helpful

The Mission¶

The deploy went bad. Error rate is climbing. You confidently type kubectl rollout undo and wait for the fix. The old version is running again. But the errors continue. The rollback... didn't work.

This happens more often than anyone admits. Rollbacks fail because deploys change more than code: database schemas, configuration, feature flags, API contracts, and cached state all evolve with deploys — and none of them roll back with kubectl rollout undo.

This lesson teaches why rollbacks fail, what each deploy strategy actually gives you, and how to build deploys that can actually be reversed.

Why Rollbacks Fail¶

Cause 1: Database migrations don't undo¶

v1 code: SELECT name, email FROM users
v2 code: SELECT name, email, phone FROM users
v2 migration: ALTER TABLE users ADD COLUMN phone VARCHAR(20)

You deploy v2 (migration runs, adds phone column), then rollback to v1. The code is v1 again. But the database still has the phone column. v1 doesn't know about it, but that's fine — SELECT ignores extra columns.

Now consider the reverse:

v1 code: SELECT name, email, phone FROM users
v2 code: SELECT name, email FROM users
v2 migration: ALTER TABLE users DROP COLUMN phone

Deploy v2 (migration drops phone). Rollback to v1. v1 tries SELECT phone on a table that no longer has it. Error. The rollback broke things worse.

Mental Model: Code rollbacks are instant. Database rollbacks are often impossible. Treat every migration as a one-way door. Design migrations so that both the old and new code versions work against the current schema.

The safe migration pattern¶

Expand-and-contract (also called "parallel change"):

Step 1: Deploy v2 code that reads phone IF EXISTS but doesn't require it
        (both v1 and v2 work against current schema)
Step 2: Run migration to add phone column
        (both v1 and v2 still work)
Step 3: Deploy v3 code that uses phone column
Step 4: Later, remove old code paths

At every step, rolling back to the previous version is safe because the schema supports both.

Cause 2: Config changes live outside the deploy¶

# Config deployed separately from code
# If code v2 depends on a new config key that v1 doesn't know about...
# Rolling back code to v1 doesn't roll back the config
API_VERSION: "v2"
NEW_FEATURE_ENDPOINT: "/api/v2/payments"

Rollback to v1 code. Config still says API_VERSION: "v2". v1 code reads API_VERSION and gets a value it doesn't understand.

Fix: Config changes should be backward-compatible, or tied to the code deploy (not managed separately in ConfigMaps).

Cause 3: API contracts broke downstream¶

v2 changed a JSON response field from user_name to username. Mobile app cached the v2 response format. Rolling back to v1 (which returns user_name) breaks the mobile app's cached expectations.

Cause 4: Caches are stale¶

v2 wrote data to Redis in a new format. Rollback to v1, which can't parse the v2 format. Every cache read fails until the v2 entries expire.

Deploy Strategies: What Each One Gives You¶

Rolling update (Kubernetes default)¶

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1        # 1 extra pod during update
    maxUnavailable: 0  # No downtime

Old and new versions run simultaneously during the rollout.

Pros	Cons
Zero downtime	Both versions serve traffic simultaneously
Built into Kubernetes	Rollback is slow (same gradual process)
Simple	Can't test new version before traffic hits it

Rollback: kubectl rollout undo — starts a new rolling update to the old version. Takes the same time as the original deploy.

Blue-Green¶

Two identical environments. Blue is live. Deploy to green. Test green. Switch traffic.

Before:  Traffic → Blue (v1)     Green (idle)
Deploy:  Traffic → Blue (v1)     Green (v2) ← deploy + test here
Switch:  Traffic → Green (v2)    Blue (v1)  ← instant switch
Rollback: Traffic → Blue (v1)    Green (v2) ← instant switch back

Pros	Cons
Instant rollback (switch traffic back)	Double the infrastructure cost
Test new version before traffic	Database migrations affect both environments
No mixed-version traffic	Requires sophisticated traffic routing

Rollback: Instant — switch traffic back to blue. But database migrations still can't be undone.

Canary¶

Route a small percentage of traffic to the new version. Monitor. Gradually increase.

Step 1:  5% → v2,  95% → v1    ← watch error rate
Step 2:  25% → v2, 75% → v1    ← watch latency
Step 3:  50% → v2, 50% → v1    ← watch everything
Step 4:  100% → v2              ← full rollout

Pros	Cons
Limits blast radius	Requires traffic splitting infrastructure
Real production validation	Mixed-version traffic (API compatibility)
Data-driven decisions	Slower rollout

Rollback: Route 100% back to v1. Fast, but any data written by the canary v2 remains.

Feature flags¶

The code is deployed, but new behavior is behind a flag that can be toggled without a deploy:

if feature_flags.is_enabled("new-payment-flow", user=current_user):
    return new_payment_flow(order)
else:
    return old_payment_flow(order)

Pros	Cons
Instant enable/disable, no deploy	Code complexity (two paths to maintain)
Can target specific users or %	Flag debt accumulates if not cleaned up
Decouple deploy from release	Testing both paths is harder

Rollback: Flip the flag. Instant. The code stays deployed but the new behavior is off. No database migration to undo, no traffic routing to change.

Building Rollback-Safe Deploys¶

Rule 1: Database migrations must be backward-compatible¶

Both the old and new code versions should work against the current schema. Use expand-and-contract:

# Adding a column: safe (old code ignores it)
# Removing a column: NOT safe (old code needs it)
# Renaming a column: NOT safe (old code uses old name)
# Changing a column type: depends on the change

# Safe: add new column, have both old and new code work, then
# remove old column in a LATER deploy

Rule 2: Config changes should be backward-compatible¶

New config keys should have defaults. Old code should ignore unknown keys. Never change the meaning of an existing key.

Rule 3: API changes should be additive¶

Add new fields; don't rename or remove existing ones. If you must make breaking changes, version the API (/v1/, /v2/).

Rule 4: Test the rollback¶

# Deploy v2
kubectl apply -f v2-deployment.yaml
# Verify it works

# Rollback to v1
kubectl rollout undo deployment/myapp
# Verify v1 still works against the current state:
# - Can v1 read v2's database schema?
# - Can v1 handle v2's cached data?
# - Can v1 parse v2's config?

If you don't test the rollback, you don't have a rollback.

Flashcard Check¶

Q1: You rollback the code but errors continue. What's the most likely cause?

A database migration that ran during the deploy. Code is v1 but the schema is v2. If v1 can't work against v2's schema, the rollback is broken.

Q2: What is expand-and-contract for database migrations?

Add new schema elements while keeping old code compatible. Then update code. Then remove old schema elements in a later deploy. At every step, rollback is safe.

Q3: Blue-green vs canary — when to use which?

Blue-green: when you need instant rollback and can afford double infrastructure. Canary: when you want to validate with real traffic and limit blast radius.

Q4: Feature flags — what's the trade-off?

Instant rollback (flip the flag) without a deploy. But code complexity increases (two paths), and flags accumulate as technical debt if not cleaned up.

Q5: What makes a deploy "rollback-safe"?

Database migrations backward-compatible, config changes have defaults, API changes are additive, and the rollback is tested before relying on it.

Cheat Sheet¶

Deploy Strategy Quick Reference¶

Strategy	Rollback speed	Infrastructure cost	Blast radius
Rolling update	Minutes	1x + surge	100% during transition
Blue-green	Instant	2x	0% (test before switch)
Canary	Fast	1x + canary	5-50% (controlled)
Feature flag	Instant	1x	Per-user/percentage

Migration Safety Rules¶

Migration type	Rollback-safe?	How to make safe
Add column	Yes	Old code ignores it
Remove column	No	Add column first, remove in later deploy
Rename column	No	Add new, copy data, update code, drop old
Add index	Yes (`CONCURRENTLY`)	Use `CREATE INDEX CONCURRENTLY`
Change column type	Depends	Add new column, migrate data, swap

Takeaways¶

Code rollbacks are easy. State rollbacks are hard. Database schemas, caches, config, and API contracts don't rollback with kubectl rollout undo.
Every migration is a one-way door. Design migrations so both old and new code work against the current schema. Expand-and-contract is the pattern.
Feature flags give the best rollback. Toggle behavior without deploys, without migrations, without downtime. But clean up old flags.
If you don't test the rollback, you don't have one. Deploy v2, then rollback to v1, and verify everything works. Make this part of the deploy process.
The rollback that fails at 3 AM is the one you never tested at 3 PM.

The Cascading Timeout — when the bad deploy cascades through services
The Database That Wouldn't Start — PostgreSQL and MySQL recovery
How Incident Response Actually Works — when rollback is step 1 of the 3Rs