Skip to content

The Rollback That Wasn't

  • lesson
  • deploy-strategies
  • rollback
  • blue-green
  • canary
  • feature-flags
  • database-migrations
  • l2 ---# The Rollback That Wasn't

Topics: deploy strategies, rollback, blue-green, canary, feature flags, database migrations Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic deploy experience helpful


The Mission

The deploy went bad. Error rate is climbing. You confidently type kubectl rollout undo and wait for the fix. The old version is running again. But the errors continue. The rollback... didn't work.

This happens more often than anyone admits. Rollbacks fail because deploys change more than code: database schemas, configuration, feature flags, API contracts, and cached state all evolve with deploys — and none of them roll back with kubectl rollout undo.

This lesson teaches why rollbacks fail, what each deploy strategy actually gives you, and how to build deploys that can actually be reversed.


Why Rollbacks Fail

Cause 1: Database migrations don't undo

v1 code: SELECT name, email FROM users
v2 code: SELECT name, email, phone FROM users
v2 migration: ALTER TABLE users ADD COLUMN phone VARCHAR(20)

You deploy v2 (migration runs, adds phone column), then rollback to v1. The code is v1 again. But the database still has the phone column. v1 doesn't know about it, but that's fine — SELECT ignores extra columns.

Now consider the reverse:

v1 code: SELECT name, email, phone FROM users
v2 code: SELECT name, email FROM users
v2 migration: ALTER TABLE users DROP COLUMN phone

Deploy v2 (migration drops phone). Rollback to v1. v1 tries SELECT phone on a table that no longer has it. Error. The rollback broke things worse.

Mental Model: Code rollbacks are instant. Database rollbacks are often impossible. Treat every migration as a one-way door. Design migrations so that both the old and new code versions work against the current schema.

The safe migration pattern

Expand-and-contract (also called "parallel change"):

Step 1: Deploy v2 code that reads phone IF EXISTS but doesn't require it
        (both v1 and v2 work against current schema)
Step 2: Run migration to add phone column
        (both v1 and v2 still work)
Step 3: Deploy v3 code that uses phone column
Step 4: Later, remove old code paths

At every step, rolling back to the previous version is safe because the schema supports both.


Cause 2: Config changes live outside the deploy

# Config deployed separately from code
# If code v2 depends on a new config key that v1 doesn't know about...
# Rolling back code to v1 doesn't roll back the config
API_VERSION: "v2"
NEW_FEATURE_ENDPOINT: "/api/v2/payments"

Rollback to v1 code. Config still says API_VERSION: "v2". v1 code reads API_VERSION and gets a value it doesn't understand.

Fix: Config changes should be backward-compatible, or tied to the code deploy (not managed separately in ConfigMaps).

Cause 3: API contracts broke downstream

v2 changed a JSON response field from user_name to username. Mobile app cached the v2 response format. Rolling back to v1 (which returns user_name) breaks the mobile app's cached expectations.

Cause 4: Caches are stale

v2 wrote data to Redis in a new format. Rollback to v1, which can't parse the v2 format. Every cache read fails until the v2 entries expire.


Deploy Strategies: What Each One Gives You

Rolling update (Kubernetes default)

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1        # 1 extra pod during update
    maxUnavailable: 0  # No downtime

Old and new versions run simultaneously during the rollout.

Pros Cons
Zero downtime Both versions serve traffic simultaneously
Built into Kubernetes Rollback is slow (same gradual process)
Simple Can't test new version before traffic hits it

Rollback: kubectl rollout undo — starts a new rolling update to the old version. Takes the same time as the original deploy.

Blue-Green

Two identical environments. Blue is live. Deploy to green. Test green. Switch traffic.

Before:  Traffic → Blue (v1)     Green (idle)
Deploy:  Traffic → Blue (v1)     Green (v2) ← deploy + test here
Switch:  Traffic → Green (v2)    Blue (v1)  ← instant switch
Rollback: Traffic → Blue (v1)    Green (v2) ← instant switch back
Pros Cons
Instant rollback (switch traffic back) Double the infrastructure cost
Test new version before traffic Database migrations affect both environments
No mixed-version traffic Requires sophisticated traffic routing

Rollback: Instant — switch traffic back to blue. But database migrations still can't be undone.

Canary

Route a small percentage of traffic to the new version. Monitor. Gradually increase.

Step 1:  5% → v2,  95% → v1    ← watch error rate
Step 2:  25% → v2, 75% → v1    ← watch latency
Step 3:  50% → v2, 50% → v1    ← watch everything
Step 4:  100% → v2              ← full rollout
Pros Cons
Limits blast radius Requires traffic splitting infrastructure
Real production validation Mixed-version traffic (API compatibility)
Data-driven decisions Slower rollout

Rollback: Route 100% back to v1. Fast, but any data written by the canary v2 remains.

Feature flags

The code is deployed, but new behavior is behind a flag that can be toggled without a deploy:

if feature_flags.is_enabled("new-payment-flow", user=current_user):
    return new_payment_flow(order)
else:
    return old_payment_flow(order)
Pros Cons
Instant enable/disable, no deploy Code complexity (two paths to maintain)
Can target specific users or % Flag debt accumulates if not cleaned up
Decouple deploy from release Testing both paths is harder

Rollback: Flip the flag. Instant. The code stays deployed but the new behavior is off. No database migration to undo, no traffic routing to change.


Building Rollback-Safe Deploys

Rule 1: Database migrations must be backward-compatible

Both the old and new code versions should work against the current schema. Use expand-and-contract:

# Adding a column: safe (old code ignores it)
# Removing a column: NOT safe (old code needs it)
# Renaming a column: NOT safe (old code uses old name)
# Changing a column type: depends on the change

# Safe: add new column, have both old and new code work, then
# remove old column in a LATER deploy

Rule 2: Config changes should be backward-compatible

New config keys should have defaults. Old code should ignore unknown keys. Never change the meaning of an existing key.

Rule 3: API changes should be additive

Add new fields; don't rename or remove existing ones. If you must make breaking changes, version the API (/v1/, /v2/).

Rule 4: Test the rollback

# Deploy v2
kubectl apply -f v2-deployment.yaml
# Verify it works

# Rollback to v1
kubectl rollout undo deployment/myapp
# Verify v1 still works against the current state:
# - Can v1 read v2's database schema?
# - Can v1 handle v2's cached data?
# - Can v1 parse v2's config?

If you don't test the rollback, you don't have a rollback.


Flashcard Check

Q1: You rollback the code but errors continue. What's the most likely cause?

A database migration that ran during the deploy. Code is v1 but the schema is v2. If v1 can't work against v2's schema, the rollback is broken.

Q2: What is expand-and-contract for database migrations?

Add new schema elements while keeping old code compatible. Then update code. Then remove old schema elements in a later deploy. At every step, rollback is safe.

Q3: Blue-green vs canary — when to use which?

Blue-green: when you need instant rollback and can afford double infrastructure. Canary: when you want to validate with real traffic and limit blast radius.

Q4: Feature flags — what's the trade-off?

Instant rollback (flip the flag) without a deploy. But code complexity increases (two paths), and flags accumulate as technical debt if not cleaned up.

Q5: What makes a deploy "rollback-safe"?

Database migrations backward-compatible, config changes have defaults, API changes are additive, and the rollback is tested before relying on it.


Cheat Sheet

Deploy Strategy Quick Reference

Strategy Rollback speed Infrastructure cost Blast radius
Rolling update Minutes 1x + surge 100% during transition
Blue-green Instant 2x 0% (test before switch)
Canary Fast 1x + canary 5-50% (controlled)
Feature flag Instant 1x Per-user/percentage

Migration Safety Rules

Migration type Rollback-safe? How to make safe
Add column Yes Old code ignores it
Remove column No Add column first, remove in later deploy
Rename column No Add new, copy data, update code, drop old
Add index Yes (CONCURRENTLY) Use CREATE INDEX CONCURRENTLY
Change column type Depends Add new column, migrate data, swap

Takeaways

  1. Code rollbacks are easy. State rollbacks are hard. Database schemas, caches, config, and API contracts don't rollback with kubectl rollout undo.

  2. Every migration is a one-way door. Design migrations so both old and new code work against the current schema. Expand-and-contract is the pattern.

  3. Feature flags give the best rollback. Toggle behavior without deploys, without migrations, without downtime. But clean up old flags.

  4. If you don't test the rollback, you don't have one. Deploy v2, then rollback to v1, and verify everything works. Make this part of the deploy process.

  5. The rollback that fails at 3 AM is the one you never tested at 3 PM.


  • The Cascading Timeout — when the bad deploy cascades through services
  • The Database That Wouldn't Start — PostgreSQL and MySQL recovery
  • How Incident Response Actually Works — when rollback is step 1 of the 3Rs