Skip to content

Runbook: Deploy Rollback

Field Value
Domain CI/CD
Alert Elevated error rate after deployment, health check failing after deploy, manual decision to roll back
Severity P1
Est. Resolution Time 10-20 minutes
Escalation Timeout 20 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, Helm access, deployment permissions, ability to push to git

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE>
If output shows: only 1 revision → There is no previous revision to roll back to; you must fix forward — see build-failure-triage.md If output shows: 2 or more revisions → You can roll back; proceed to Step 2

Step 1: Confirm the Deployment Is the Cause

Why: Rolling back when the deployment is not the cause wastes time and may make things worse — confirm the error timeline matches the deployment time before acting.

# Check when the current deployment was rolled out
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE>

# Check when errors started (Prometheus/Grafana — look at error rate graph)
# Or check application logs around the deploy time:
kubectl logs deployment/<DEPLOY_NAME> -n <NAMESPACE> --since=30m | grep -i "error\|panic\|fatal" | tail -30

# Check recent events for the deployment:
kubectl describe deployment <DEPLOY_NAME> -n <NAMESPACE> | grep -A10 "Events"
Expected output:
Events show a recent "Scaled up replica set" or "Updated image" matching the time
when errors began. Log errors start at approximately the same time as the deploy.
If this fails: If errors predate the deployment, the deployment is not the cause. Check for a database issue, upstream service degradation, or traffic spike instead.

Step 2: Identify the Previous Good Revision

Why: You must know which revision to roll back to — rolling back to the wrong one can deploy a previously broken version.

# If using Helm (most common in Kubernetes environments):
helm history <RELEASE_NAME> -n <NAMESPACE>

# If using kubectl directly:
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE>

# To see what image/config a specific revision used:
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE> --revision=<REVISION_NUMBER>
Expected output:
Helm output:
REVISION  UPDATED                   STATUS     CHART            APP VERSION  DESCRIPTION
1         Thu Jan 01 00:00:00 2026  superseded myapp-1.2.3      1.2.3        Install complete
2         Mon Mar 18 14:22:00 2026  deployed   myapp-1.2.4      1.2.4        Upgrade complete

kubectl output:
REVISION  CHANGE-CAUSE
1         <none>
2         Update image to v1.2.4
If this fails: If history is empty or shows only one revision, check if revisionHistoryLimit is set to 0 in the deployment spec — if so, you cannot roll back with kubectl and must redeploy the previous image tag manually.

Step 3: Execute the Rollback

Why: Getting back to a known-good state stops the bleeding — this is the most time-critical step.

# Option A — Roll back with Helm (use this if Helm manages the deployment):
helm rollback <RELEASE_NAME> <REVISION_NUMBER> -n <NAMESPACE>

# Option B — Roll back with kubectl (use this if deployed with plain kubectl):
kubectl rollout undo deployment/<DEPLOY_NAME> -n <NAMESPACE>

# To roll back to a specific revision (not just the previous one):
kubectl rollout undo deployment/<DEPLOY_NAME> -n <NAMESPACE> --to-revision=<REVISION_NUMBER>
Expected output:
Helm: "Rollback was a success! Happy Helming!"
kubectl: "deployment.apps/<DEPLOY_NAME> rolled back"
If this fails: If the rollback command errors with "no previous revision," you must redeploy the previous Docker image tag manually: kubectl set image deployment/<DEPLOY_NAME> <CONTAINER_NAME>=<IMAGE>:<PREVIOUS_TAG> -n <NAMESPACE>

Step 4: Monitor Rollout Status

Why: The rollback command initiates the rollout but does not wait for it to complete — you must confirm the new pods are healthy before declaring success.

# Watch the rollout progress in real time
kubectl rollout status deployment/<DEPLOY_NAME> -n <NAMESPACE>

# In a second terminal, watch pods come up:
kubectl get pods -n <NAMESPACE> -w -l app=<APP_LABEL>
Expected output:
kubectl rollout status output:
"Waiting for deployment "<DEPLOY_NAME>" rollout to finish: 1 old replicas are pending termination..."
"deployment "<DEPLOY_NAME>" successfully rolled out"

kubectl get pods output should show all pods in Running state with READY showing all containers ready.
If this fails: If pods are stuck in CrashLoopBackOff or Pending after rollback, the previous revision may also have an issue. Check pod logs: kubectl logs <POD_NAME> -n <NAMESPACE> --previous

Step 5: Verify Error Rate Has Dropped

Why: The pods being healthy does not automatically mean the application is serving traffic correctly — verify end-to-end.

# Check application logs for continued errors:
kubectl logs deployment/<DEPLOY_NAME> -n <NAMESPACE> --since=5m | grep -i "error\|panic\|fatal" | tail -20

# If you have Prometheus/Grafana: check the error rate dashboard
# Typical PromQL query to check error rate:
# rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Quick HTTP health check if service has a health endpoint:
kubectl port-forward service/<SERVICE_NAME> 8080:80 -n <NAMESPACE> &
curl -s http://localhost:8080/health
Expected output:
No new ERROR lines in logs after the rollback completed.
Health endpoint returns: {"status": "ok"} or HTTP 200.
Grafana error rate graph shows the rate dropping back to baseline.
If this fails: If errors continue after rollback, the problem is not the deployment. Escalate immediately — this may be a database corruption, external dependency failure, or data-plane issue.

Step 6: Notify Team and Create Postmortem Ticket

Why: Stakeholders need to know about rollbacks; the team needs to investigate before re-attempting the deployment; P1 incidents require a postmortem.

# Post in the team incident channel (Slack/Teams):
# Template:
# "ROLLBACK COMPLETE: Rolled back <SERVICE_NAME> from <BAD_VERSION> to <GOOD_VERSION> in <NAMESPACE>.
#  Error rate has returned to baseline.
#  Root cause under investigation — DO NOT redeploy <BAD_VERSION> until fixed.
#  Postmortem ticket: <TICKET_LINK>"

# Create a postmortem ticket in your issue tracker (Jira/Linear/GitHub Issues)
# Title: "Postmortem: <SERVICE_NAME> rollback on <DATE>"
echo "Notify team in incident channel and create postmortem ticket"
Expected output:
Acknowledgment from team in incident channel.
Postmortem ticket created and assigned to the engineer who made the bad deploy.
If this fails: If you cannot reach the team via normal channels during an active incident, use the emergency escalation path (PagerDuty / phone).

Verification

# Confirm the issue is resolved
kubectl rollout status deployment/<DEPLOY_NAME> -n <NAMESPACE>
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>
Success looks like: All pods show Running with all containers READY. rollout status reports "successfully rolled out". Error rate in monitoring has returned to pre-incident baseline. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 20 min Platform/Infra on-call "P1: Rollback of is not working, error rate still elevated, need immediate help"
Rollback fails (no previous revision) Platform/Infra on-call "P1: Cannot roll back — no previous revision available, need emergency redeploy"
Security incident Security on-call "Security incident: suspected malicious code deployed to , rolling back now"
Scope expanding (multiple services affected) Platform/Infra on-call "Multiple services degraded after deploy, possible shared dependency issue — "

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Block the bad version from being redeployed until the fix is confirmed
  • Identify why the issue wasn't caught in staging — improve pre-deploy testing
  • Review whether canary deployment or blue/green would have limited the blast radius

Common Mistakes

  1. Rolling back to the wrong revision: Always check helm history or kubectl rollout history before executing — confirm which revision was the last known-good one.
  2. Not verifying error rate drops after rollback: The rollout completing successfully does not mean the error is gone — always check metrics and logs.
  3. Forgetting to notify the team: Rollbacks affect ongoing work (the bad commit is now stuck); engineers need to know not to redeploy the broken version.
  4. Rolling forward when rollback is the right call: If errors start immediately after deploy, roll back first — investigate later. Do not spend 20 minutes debugging while users are experiencing errors.
  5. Not creating a postmortem ticket: The cause of the bad deploy will be forgotten without a ticket. Create it immediately while context is fresh.

Cross-References

  • Topic Pack: training/library/topic-packs/cicd-fundamentals/ (deep background on deployment strategies)
  • Related Runbook: build-failure-triage.md — if you need to fix the code before redeploying
  • Related Runbook: registry-pull-failure.md — if the rollback itself fails due to image pull issues
  • Related Runbook: ../kubernetes/crashloopbackoff.md — if pods are crashing after rollback

Wiki Navigation