The Rollback That Wasn't¶
Category: The Incident Domains: ci-cd, database-ops Read time: ~5 min
Setting the Scene¶
I was the tech lead at a mid-size travel booking platform, about 150 engineers. We'd built a reasonable deployment pipeline: GitHub PR, CI tests, staging deploy, production deploy with a "rollback" button in our internal deploy tool that would redeploy the previous Docker image. We felt good about our rollback story. We were wrong.
What Happened¶
Tuesday 11:00 AM — We deploy version 4.12.0 of our booking service. It includes a database migration that adds three new columns to the reservations table, renames booking_ref to confirmation_code, and drops the old booking_ref column. The migration runs clean. The new code works with the new schema. Everything looks fine in staging. Ship it.
Tuesday 11:15 AM — Production deploy completes. Migration runs. New columns added, booking_ref renamed to confirmation_code, old column dropped. The app starts up, tests pass, metrics look good.
Tuesday 11:45 AM — Customer support reports that partner API integrations are failing. Our partner-facing API still references booking_ref in a code path that wasn't covered by the migration test suite. The partners are getting 500 errors on every booking lookup.
Tuesday 11:50 AM — "Just hit the rollback button," someone says. We deploy version 4.11.0. The old code starts up and immediately crashes. ProgrammingError: column "booking_ref" does not exist. Version 4.11.0 expects booking_ref. The database has confirmation_code. The old code can't read the new schema.
Tuesday 11:55 AM — We try to "roll forward" by hotfixing the partner API code in version 4.12.1. But the developer who wrote the migration is in a meeting. We page her out.
Tuesday 12:10 PM — She writes a fix for the partner API endpoint, updating it to use confirmation_code. PR, CI, deploy. Fifteen minutes of partner API downtime.
Tuesday 12:30 PM — We realize the larger problem: if the 4.12.0 deploy had been catastrophically broken (not just one endpoint), we'd have had no rollback path at all. The database migration was a one-way door. Our "rollback button" was a lie for any deploy that included a schema change.
The Moment of Truth¶
We'd been treating rollback as "redeploy the old code." But code doesn't run in isolation — it runs against a database schema that may have changed irreversibly. Our rollback strategy was tested for code-only changes and never validated for deploys with migrations. The rollback button should have been grayed out with a warning: "this deploy includes irreversible migrations."
The Aftermath¶
We adopted the "expand-contract" pattern for all schema migrations. Phase 1: add new columns, keep old columns, code works with both. Phase 2 (next deploy): migrate data, update code to use only new columns. Phase 3 (two deploys later): drop old columns. Every migration had to be backward-compatible with the previous code version. We also added a pre-deploy check that would flag irreversible migrations and require explicit acknowledgment that rollback would not be possible. The "rollback button" got a conditional warning based on migration analysis.
The Lessons¶
- Migrations must be backward-compatible: The previous code version must be able to run against the new schema. This means expand-contract: add columns before using them, drop columns only after no code references them.
- Blue-green deployments need schema compatibility: If your deployment strategy involves running two versions simultaneously (or being able to switch between them), the database schema must support both versions at all times.
- Test your rollbacks: A rollback procedure that has never been tested is not a procedure — it's a hope. Include rollback testing as part of your deploy validation, especially for deploys with migrations.
What I'd Do Differently¶
I'd implement automated rollback testing in the CI pipeline: apply the migration, run the new code's tests, then roll back to the old code against the new schema and run the old code's tests too. If the old code can't work with the new schema, the pipeline should fail with a clear message. I'd also adopt a tool like squitch or flyway with explicit support for reversible migrations and mandatory rollback scripts.
The Quote¶
"Our rollback button was a time machine that only worked when nothing important had changed."
Cross-References¶
- Topic Packs: CI/CD Pipelines & Patterns, Database Ops, Progressive Delivery
- Case Studies: Cross-Domain