Pattern: Untested Rollback Procedure¶
ID: FP-052 Family: Human Error Amplifier Frequency: Common Blast Radius: Single Service Detection Difficulty: Obvious (during rollback)
The Shape¶
A deployment causes a production issue. The team attempts to roll back. The rollback procedure — never previously tested — fails. The old version can't run on the new schema (a forward-only migration ran). The previous container image was overwritten (mutable tag, FP-033). The Helm rollback command requires manual intervention that was never documented. The team is now fighting both the original issue and a broken rollback, with an extended outage and no tested path back to stability.
How You'll See It¶
In Kubernetes¶
$ kubectl rollout undo deployment/myapp
# deployment.apps/myapp rolled back
# But the previous image was tagged :latest (FP-033) and has been updated.
# The "rolled back" deployment now runs the same broken image.
# kubectl describe deployment shows imageID is the same as the broken version.
Or: a database migration ran as part of the deploy. The new code expects new schema. The old code doesn't understand the new schema. Rolling back the code causes the app to crash on the new schema it can't read.
In Linux/Infrastructure¶
A package upgrade breaks a service. apt-get install package=old-version fails because
the old version is no longer in the apt cache. dpkg --get-selections shows the current
version; the previous version's .deb file was not archived. Manual source build is the
only option.
In CI/CD¶
A CD pipeline deploys v2.0. v2.0 breaks production. The CD pipeline's "rollback" button triggers a new deploy of the previous artifact. But the previous artifact was deleted from the artifact repository after 7 days (retention policy). Rollback fails with "artifact not found."
The Tell¶
Rollback was attempted but failed. The rollback procedure had never been tested before the incident. The failure was due to: missing artifact, schema mismatch, hardcoded mutable tag, or undocumented manual steps.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Deployment bug | Rollback procedure untested | Deployment itself was fine; the rollback of that deployment is what broke things |
| Incompatible versions | Untested rollback assumption | The incompatibility is in the rollback path, not the deployment path |
| Infrastructure issue | Procedure gap | Infrastructure is healthy; the missing artifact or schema conflict is the issue |
The Fix (Generic)¶
- Immediate: Find an older artifact (S3 backup, Docker registry layers, git tag); manually deploy it; if schema conflict, consider a forward migration that makes new schema backward-compatible.
- Short-term: After each deploy, validate that rollback works:
kubectl rollout undo --dry-run; ensure the previous image digest is pinned (not via mutable tag); test rollback in staging. - Long-term: Include rollback testing in the deploy pipeline; keep previous N artifacts in the registry; use expand-contract schema migration pattern (never break backward compatibility in a single migration); document rollback prerequisites in every deploy runbook.
Real-World Examples¶
- Example 1: Deploy added a
NOT NULLcolumn without a default. Old code didn't provide this column. Rollback: old code crashes on insert (column required but not provided). Rollback made things worse; required a forward migration to add a default value before the old code could run. - Example 2: Artifact retention set to 7 days. A critical bug discovered on day 8 required rollback to the pre-bug version. Artifact was gone. Team had to cherry-revert the bug fix and deploy a patched version forward instead (2 hours).
War Story¶
We deployed at 2pm. Bug in production at 2:15pm. "Just roll back." Ran
kubectl rollout undo. Deploy reverted. Bug still there. Checked image digest: the rollback restored the:latesttag — which still pointed to the broken image (we had tagged:latestduring the deploy). We had rolled back the deployment object but not the actual image. It took another 45 minutes to find the previous image's SHA256, pin the deployment to it, and actually roll back. We now: (1) never use:latestin production, (2) pin all production images to SHA256 digest before deploy, (3) test rollback in staging as part of every release candidate.
Cross-References¶
- Topic Packs: k8s-ops, cicd
- Footguns: k8s-ops/footguns.md
- Case Studies: ops-archaeology/11-dr-failover-broken/
- Related Patterns: FP-025 (untested backup — same "tested too late" pattern), FP-033 (latest tag in prod — causes rollback to fail)