devops
l1
runbook
cicd
gitops --- Portal | Level: L1: Foundations | Topics: CI/CD, GitOps | Domain: DevOps & Tooling

Runbook: Deploy Rollback¶

Field	Value
Domain	CI/CD
Alert	Elevated error rate after deployment, health check failing after deploy, manual decision to roll back
Severity	P1
Est. Resolution Time	10-20 minutes
Escalation Timeout	20 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	kubectl access, Helm access, deployment permissions, ability to push to git

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE>

If output shows: only 1 revision → There is no previous revision to roll back to; you must fix forward — see build-failure-triage.md If output shows: 2 or more revisions → You can roll back; proceed to Step 2

Step 1: Confirm the Deployment Is the Cause¶

Why: Rolling back when the deployment is not the cause wastes time and may make things worse — confirm the error timeline matches the deployment time before acting.

# Check when the current deployment was rolled out
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE>

# Check when errors started (Prometheus/Grafana — look at error rate graph)
# Or check application logs around the deploy time:
kubectl logs deployment/<DEPLOY_NAME> -n <NAMESPACE> --since=30m | grep -i "error\|panic\|fatal" | tail -30

# Check recent events for the deployment:
kubectl describe deployment <DEPLOY_NAME> -n <NAMESPACE> | grep -A10 "Events"

Expected output:

Events show a recent "Scaled up replica set" or "Updated image" matching the time
when errors began. Log errors start at approximately the same time as the deploy.

If this fails: If errors predate the deployment, the deployment is not the cause. Check for a database issue, upstream service degradation, or traffic spike instead.

Step 2: Identify the Previous Good Revision¶

Why: You must know which revision to roll back to — rolling back to the wrong one can deploy a previously broken version.

# If using Helm (most common in Kubernetes environments):
helm history <RELEASE_NAME> -n <NAMESPACE>

# If using kubectl directly:
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE>

# To see what image/config a specific revision used:
kubectl rollout history deployment/<DEPLOY_NAME> -n <NAMESPACE> --revision=<REVISION_NUMBER>

Expected output:

Helm output:
REVISION  UPDATED                   STATUS     CHART            APP VERSION  DESCRIPTION
1         Thu Jan 01 00:00:00 2026  superseded myapp-1.2.3      1.2.3        Install complete
2         Mon Mar 18 14:22:00 2026  deployed   myapp-1.2.4      1.2.4        Upgrade complete

kubectl output:
REVISION  CHANGE-CAUSE
1         <none>
2         Update image to v1.2.4

If this fails: If history is empty or shows only one revision, check if revisionHistoryLimit is set to 0 in the deployment spec — if so, you cannot roll back with kubectl and must redeploy the previous image tag manually.

Step 3: Execute the Rollback¶

Why: Getting back to a known-good state stops the bleeding — this is the most time-critical step.

# Option A — Roll back with Helm (use this if Helm manages the deployment):
helm rollback <RELEASE_NAME> <REVISION_NUMBER> -n <NAMESPACE>

# Option B — Roll back with kubectl (use this if deployed with plain kubectl):
kubectl rollout undo deployment/<DEPLOY_NAME> -n <NAMESPACE>

# To roll back to a specific revision (not just the previous one):
kubectl rollout undo deployment/<DEPLOY_NAME> -n <NAMESPACE> --to-revision=<REVISION_NUMBER>

Expected output:

Helm: "Rollback was a success! Happy Helming!"
kubectl: "deployment.apps/<DEPLOY_NAME> rolled back"

If this fails: If the rollback command errors with "no previous revision," you must redeploy the previous Docker image tag manually: kubectl set image deployment/<DEPLOY_NAME> <CONTAINER_NAME>=<IMAGE>:<PREVIOUS_TAG> -n <NAMESPACE>

Step 4: Monitor Rollout Status¶

Why: The rollback command initiates the rollout but does not wait for it to complete — you must confirm the new pods are healthy before declaring success.

# Watch the rollout progress in real time
kubectl rollout status deployment/<DEPLOY_NAME> -n <NAMESPACE>

# In a second terminal, watch pods come up:
kubectl get pods -n <NAMESPACE> -w -l app=<APP_LABEL>

Expected output:

kubectl rollout status output:
"Waiting for deployment "<DEPLOY_NAME>" rollout to finish: 1 old replicas are pending termination..."
"deployment "<DEPLOY_NAME>" successfully rolled out"

kubectl get pods output should show all pods in Running state with READY showing all containers ready.

If this fails: If pods are stuck in CrashLoopBackOff or Pending after rollback, the previous revision may also have an issue. Check pod logs: kubectl logs <POD_NAME> -n <NAMESPACE> --previous

Step 5: Verify Error Rate Has Dropped¶

Why: The pods being healthy does not automatically mean the application is serving traffic correctly — verify end-to-end.

# Check application logs for continued errors:
kubectl logs deployment/<DEPLOY_NAME> -n <NAMESPACE> --since=5m | grep -i "error\|panic\|fatal" | tail -20

# If you have Prometheus/Grafana: check the error rate dashboard
# Typical PromQL query to check error rate:
# rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Quick HTTP health check if service has a health endpoint:
kubectl port-forward service/<SERVICE_NAME> 8080:80 -n <NAMESPACE> &
curl -s http://localhost:8080/health

Expected output:

No new ERROR lines in logs after the rollback completed.
Health endpoint returns: {"status": "ok"} or HTTP 200.
Grafana error rate graph shows the rate dropping back to baseline.

If this fails: If errors continue after rollback, the problem is not the deployment. Escalate immediately — this may be a database corruption, external dependency failure, or data-plane issue.

Step 6: Notify Team and Create Postmortem Ticket¶

Why: Stakeholders need to know about rollbacks; the team needs to investigate before re-attempting the deployment; P1 incidents require a postmortem.

# Post in the team incident channel (Slack/Teams):
# Template:
# "ROLLBACK COMPLETE: Rolled back <SERVICE_NAME> from <BAD_VERSION> to <GOOD_VERSION> in <NAMESPACE>.
#  Error rate has returned to baseline.
#  Root cause under investigation — DO NOT redeploy <BAD_VERSION> until fixed.
#  Postmortem ticket: <TICKET_LINK>"

# Create a postmortem ticket in your issue tracker (Jira/Linear/GitHub Issues)
# Title: "Postmortem: <SERVICE_NAME> rollback on <DATE>"
echo "Notify team in incident channel and create postmortem ticket"

Expected output:

Acknowledgment from team in incident channel.
Postmortem ticket created and assigned to the engineer who made the bad deploy.

If this fails: If you cannot reach the team via normal channels during an active incident, use the emergency escalation path (PagerDuty / phone).

Verification¶

# Confirm the issue is resolved
kubectl rollout status deployment/<DEPLOY_NAME> -n <NAMESPACE>
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>

Success looks like: All pods show Running with all containers READY. rollout status reports "successfully rolled out". Error rate in monitoring has returned to pre-incident baseline. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 20 min	Platform/Infra on-call	"P1: Rollback of is not working, error rate still elevated, need immediate help"
Rollback fails (no previous revision)	Platform/Infra on-call	"P1: Cannot roll back — no previous revision available, need emergency redeploy"
Security incident	Security on-call	"Security incident: suspected malicious code deployed to , rolling back now"
Scope expanding (multiple services affected)	Platform/Infra on-call	"Multiple services degraded after deploy, possible shared dependency issue — "

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Block the bad version from being redeployed until the fix is confirmed
Identify why the issue wasn't caught in staging — improve pre-deploy testing
Review whether canary deployment or blue/green would have limited the blast radius

Common Mistakes¶

Rolling back to the wrong revision: Always check helm history or kubectl rollout history before executing — confirm which revision was the last known-good one.
Not verifying error rate drops after rollback: The rollout completing successfully does not mean the error is gone — always check metrics and logs.
Forgetting to notify the team: Rollbacks affect ongoing work (the bad commit is now stuck); engineers need to know not to redeploy the broken version.
Rolling forward when rollback is the right call: If errors start immediately after deploy, roll back first — investigate later. Do not spend 20 minutes debugging while users are experiencing errors.
Not creating a postmortem ticket: The cause of the bad deploy will be forgotten without a ticket. Create it immediately while context is fresh.

Cross-References¶

Topic Pack: training/library/topic-packs/cicd-fundamentals/ (deep background on deployment strategies)
Related Runbook: build-failure-triage.md — if you need to fix the code before redeploying
Related Runbook: registry-pull-failure.md — if the rollback itself fails due to image pull issues
Related Runbook: ../kubernetes/crashloopbackoff.md — if pods are crashing after rollback

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — CI/CD
Argo Flashcards (CLI) (flashcard_deck, L1) — GitOps
CI Pipeline Documentation (Reference, L1) — CI/CD
CI/CD Drills (Drill, L1) — CI/CD
CI/CD Flashcards (CLI) (flashcard_deck, L1) — CI/CD
CI/CD Pipelines & Patterns (Topic Pack, L1) — CI/CD
Circleci Flashcards (CLI) (flashcard_deck, L1) — CI/CD
Dagger / CI as Code (Topic Pack, L2) — CI/CD
Deep Dive: CI/CD Pipeline Architecture (deep_dive, L2) — CI/CD
GitHub Actions (Topic Pack, L1) — CI/CD

Runbook: Deploy Rollback¶

Quick Assessment (30 seconds)¶

Step 1: Confirm the Deployment Is the Cause¶

Step 2: Identify the Previous Good Revision¶

Step 3: Execute the Rollback¶

Step 4: Monitor Rollout Status¶

Step 5: Verify Error Rate Has Dropped¶

Step 6: Notify Team and Create Postmortem Ticket¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Deploy Rollback¶

Quick Assessment (30 seconds)¶

Step 1: Confirm the Deployment Is the Cause¶

Step 2: Identify the Previous Good Revision¶

Step 3: Execute the Rollback¶

Step 4: Monitor Rollout Status¶

Step 5: Verify Error Rate Has Dropped¶

Step 6: Notify Team and Create Postmortem Ticket¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶