- devops
- l1
- runbook
- cicd
- gitops --- Portal | Level: L1: Foundations | Topics: CI/CD, GitOps | Domain: DevOps & Tooling
Runbook: Deploy Rollback¶
| Field | Value |
|---|---|
| Domain | CI/CD |
| Alert | Elevated error rate after deployment, health check failing after deploy, manual decision to roll back |
| Severity | P1 |
| Est. Resolution Time | 10-20 minutes |
| Escalation Timeout | 20 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, Helm access, deployment permissions, ability to push to git |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE>
Step 1: Confirm the Deployment Is the Cause¶
Why: Rolling back when the deployment is not the cause wastes time and may make things worse — confirm the error timeline matches the deployment time before acting.
# Check when the current deployment was rolled out
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE>
# Check when errors started (Prometheus/Grafana — look at error rate graph)
# Or check application logs around the deploy time:
kubectl logs deployment/<DEPLOY_NAME> -n <NAMESPACE> --since=30m | grep -i "error\|panic\|fatal" | tail -30
# Check recent events for the deployment:
kubectl describe deployment <DEPLOY_NAME> -n <NAMESPACE> | grep -A10 "Events"
Events show a recent "Scaled up replica set" or "Updated image" matching the time
when errors began. Log errors start at approximately the same time as the deploy.
Step 2: Identify the Previous Good Revision¶
Why: You must know which revision to roll back to — rolling back to the wrong one can deploy a previously broken version.
# If using Helm (most common in Kubernetes environments):
helm history <RELEASE_NAME> -n <NAMESPACE>
# If using kubectl directly:
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE>
# To see what image/config a specific revision used:
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE> --revision=<REVISION_NUMBER>
Helm output:
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Thu Jan 01 00:00:00 2026 superseded myapp-1.2.3 1.2.3 Install complete
2 Mon Mar 18 14:22:00 2026 deployed myapp-1.2.4 1.2.4 Upgrade complete
kubectl output:
REVISION CHANGE-CAUSE
1 <none>
2 Update image to v1.2.4
revisionHistoryLimit is set to 0 in the deployment spec — if so, you cannot roll back with kubectl and must redeploy the previous image tag manually.
Step 3: Execute the Rollback¶
Why: Getting back to a known-good state stops the bleeding — this is the most time-critical step.
# Option A — Roll back with Helm (use this if Helm manages the deployment):
helm rollback <RELEASE_NAME> <REVISION_NUMBER> -n <NAMESPACE>
# Option B — Roll back with kubectl (use this if deployed with plain kubectl):
kubectl rollout undo deployment/<DEPLOY_NAME> -n <NAMESPACE>
# To roll back to a specific revision (not just the previous one):
kubectl rollout undo deployment/<DEPLOY_NAME> -n <NAMESPACE> --to-revision=<REVISION_NUMBER>
kubectl set image deployment/<DEPLOY_NAME> <CONTAINER_NAME>=<IMAGE>:<PREVIOUS_TAG> -n <NAMESPACE>
Step 4: Monitor Rollout Status¶
Why: The rollback command initiates the rollout but does not wait for it to complete — you must confirm the new pods are healthy before declaring success.
# Watch the rollout progress in real time
kubectl rollout status deployment/<DEPLOY_NAME> -n <NAMESPACE>
# In a second terminal, watch pods come up:
kubectl get pods -n <NAMESPACE> -w -l app=<APP_LABEL>
kubectl rollout status output:
"Waiting for deployment "<DEPLOY_NAME>" rollout to finish: 1 old replicas are pending termination..."
"deployment "<DEPLOY_NAME>" successfully rolled out"
kubectl get pods output should show all pods in Running state with READY showing all containers ready.
CrashLoopBackOff or Pending after rollback, the previous revision may also have an issue. Check pod logs: kubectl logs <POD_NAME> -n <NAMESPACE> --previous
Step 5: Verify Error Rate Has Dropped¶
Why: The pods being healthy does not automatically mean the application is serving traffic correctly — verify end-to-end.
# Check application logs for continued errors:
kubectl logs deployment/<DEPLOY_NAME> -n <NAMESPACE> --since=5m | grep -i "error\|panic\|fatal" | tail -20
# If you have Prometheus/Grafana: check the error rate dashboard
# Typical PromQL query to check error rate:
# rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Quick HTTP health check if service has a health endpoint:
kubectl port-forward service/<SERVICE_NAME> 8080:80 -n <NAMESPACE> &
curl -s http://localhost:8080/health
No new ERROR lines in logs after the rollback completed.
Health endpoint returns: {"status": "ok"} or HTTP 200.
Grafana error rate graph shows the rate dropping back to baseline.
Step 6: Notify Team and Create Postmortem Ticket¶
Why: Stakeholders need to know about rollbacks; the team needs to investigate before re-attempting the deployment; P1 incidents require a postmortem.
# Post in the team incident channel (Slack/Teams):
# Template:
# "ROLLBACK COMPLETE: Rolled back <SERVICE_NAME> from <BAD_VERSION> to <GOOD_VERSION> in <NAMESPACE>.
# Error rate has returned to baseline.
# Root cause under investigation — DO NOT redeploy <BAD_VERSION> until fixed.
# Postmortem ticket: <TICKET_LINK>"
# Create a postmortem ticket in your issue tracker (Jira/Linear/GitHub Issues)
# Title: "Postmortem: <SERVICE_NAME> rollback on <DATE>"
echo "Notify team in incident channel and create postmortem ticket"
Acknowledgment from team in incident channel.
Postmortem ticket created and assigned to the engineer who made the bad deploy.
Verification¶
# Confirm the issue is resolved
kubectl rollout status deployment/<DEPLOY_NAME> -n <NAMESPACE>
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>
Running with all containers READY. rollout status reports "successfully rolled out". Error rate in monitoring has returned to pre-incident baseline.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 20 min | Platform/Infra on-call | "P1: Rollback of |
| Rollback fails (no previous revision) | Platform/Infra on-call | "P1: Cannot roll back |
| Security incident | Security on-call | "Security incident: suspected malicious code deployed to |
| Scope expanding (multiple services affected) | Platform/Infra on-call | "Multiple services degraded after deploy, possible shared dependency issue —
|
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Block the bad version from being redeployed until the fix is confirmed
- Identify why the issue wasn't caught in staging — improve pre-deploy testing
- Review whether canary deployment or blue/green would have limited the blast radius
Common Mistakes¶
- Rolling back to the wrong revision: Always check
helm historyorkubectl rollout historybefore executing — confirm which revision was the last known-good one. - Not verifying error rate drops after rollback: The rollout completing successfully does not mean the error is gone — always check metrics and logs.
- Forgetting to notify the team: Rollbacks affect ongoing work (the bad commit is now stuck); engineers need to know not to redeploy the broken version.
- Rolling forward when rollback is the right call: If errors start immediately after deploy, roll back first — investigate later. Do not spend 20 minutes debugging while users are experiencing errors.
- Not creating a postmortem ticket: The cause of the bad deploy will be forgotten without a ticket. Create it immediately while context is fresh.
Cross-References¶
- Topic Pack:
training/library/topic-packs/cicd-fundamentals/(deep background on deployment strategies) - Related Runbook: build-failure-triage.md — if you need to fix the code before redeploying
- Related Runbook: registry-pull-failure.md — if the rollback itself fails due to image pull issues
- Related Runbook:
../kubernetes/crashloopbackoff.md— if pods are crashing after rollback
Wiki Navigation¶
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — CI/CD
- Argo Flashcards (CLI) (flashcard_deck, L1) — GitOps
- CI Pipeline Documentation (Reference, L1) — CI/CD
- CI/CD Drills (Drill, L1) — CI/CD
- CI/CD Flashcards (CLI) (flashcard_deck, L1) — CI/CD
- CI/CD Pipelines & Patterns (Topic Pack, L1) — CI/CD
- Circleci Flashcards (CLI) (flashcard_deck, L1) — CI/CD
- Dagger / CI as Code (Topic Pack, L2) — CI/CD
- Deep Dive: CI/CD Pipeline Architecture (deep_dive, L2) — CI/CD
- GitHub Actions (Topic Pack, L1) — CI/CD
Pages that link here¶
- ArgoCD & GitOps - Primer
- CI/CD Pipeline Architecture
- Dagger
- GitOps & ArgoCD Drills
- GitOps (ArgoCD) - Skill Check
- Github Actions
- Operational Runbooks
- Runbook: ArgoCD Application OutOfSync
- Runbook: Build Failure Triage
- Runbook: Container Registry Pull Failure
- Scenario: Config Drift Detected in Production
- Scenario: GitOps Drift Causing Outage