Skip to content

Pattern: Untested Rollback Procedure

ID: FP-052 Family: Human Error Amplifier Frequency: Common Blast Radius: Single Service Detection Difficulty: Obvious (during rollback)

The Shape

A deployment causes a production issue. The team attempts to roll back. The rollback procedure — never previously tested — fails. The old version can't run on the new schema (a forward-only migration ran). The previous container image was overwritten (mutable tag, FP-033). The Helm rollback command requires manual intervention that was never documented. The team is now fighting both the original issue and a broken rollback, with an extended outage and no tested path back to stability.

How You'll See It

In Kubernetes

$ kubectl rollout undo deployment/myapp
# deployment.apps/myapp rolled back

# But the previous image was tagged :latest (FP-033) and has been updated.
# The "rolled back" deployment now runs the same broken image.
# kubectl describe deployment shows imageID is the same as the broken version.

Or: a database migration ran as part of the deploy. The new code expects new schema. The old code doesn't understand the new schema. Rolling back the code causes the app to crash on the new schema it can't read.

In Linux/Infrastructure

A package upgrade breaks a service. apt-get install package=old-version fails because the old version is no longer in the apt cache. dpkg --get-selections shows the current version; the previous version's .deb file was not archived. Manual source build is the only option.

In CI/CD

A CD pipeline deploys v2.0. v2.0 breaks production. The CD pipeline's "rollback" button triggers a new deploy of the previous artifact. But the previous artifact was deleted from the artifact repository after 7 days (retention policy). Rollback fails with "artifact not found."

The Tell

Rollback was attempted but failed. The rollback procedure had never been tested before the incident. The failure was due to: missing artifact, schema mismatch, hardcoded mutable tag, or undocumented manual steps.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Deployment bug Rollback procedure untested Deployment itself was fine; the rollback of that deployment is what broke things
Incompatible versions Untested rollback assumption The incompatibility is in the rollback path, not the deployment path
Infrastructure issue Procedure gap Infrastructure is healthy; the missing artifact or schema conflict is the issue

The Fix (Generic)

  1. Immediate: Find an older artifact (S3 backup, Docker registry layers, git tag); manually deploy it; if schema conflict, consider a forward migration that makes new schema backward-compatible.
  2. Short-term: After each deploy, validate that rollback works: kubectl rollout undo --dry-run; ensure the previous image digest is pinned (not via mutable tag); test rollback in staging.
  3. Long-term: Include rollback testing in the deploy pipeline; keep previous N artifacts in the registry; use expand-contract schema migration pattern (never break backward compatibility in a single migration); document rollback prerequisites in every deploy runbook.

Real-World Examples

  • Example 1: Deploy added a NOT NULL column without a default. Old code didn't provide this column. Rollback: old code crashes on insert (column required but not provided). Rollback made things worse; required a forward migration to add a default value before the old code could run.
  • Example 2: Artifact retention set to 7 days. A critical bug discovered on day 8 required rollback to the pre-bug version. Artifact was gone. Team had to cherry-revert the bug fix and deploy a patched version forward instead (2 hours).

War Story

We deployed at 2pm. Bug in production at 2:15pm. "Just roll back." Ran kubectl rollout undo. Deploy reverted. Bug still there. Checked image digest: the rollback restored the :latest tag — which still pointed to the broken image (we had tagged :latest during the deploy). We had rolled back the deployment object but not the actual image. It took another 45 minutes to find the previous image's SHA256, pin the deployment to it, and actually roll back. We now: (1) never use :latest in production, (2) pin all production images to SHA256 digest before deploy, (3) test rollback in staging as part of every release candidate.

Cross-References