Portal | Level: L2: Operations | Topics: Helm | Domain: DevOps & Tooling
Scenario: Helm Upgrade Broke Prod — Recover Fast¶
The Prompt¶
"A colleague ran
helm upgradein production with a new values file. The app is now down — pods are in CrashLoopBackOff. The team is panicking. Walk me through your recovery."
Initial Report¶
Incident channel: "Production is DOWN. @channel someone ran helm upgrade with the wrong values file 5 minutes ago. Pods are crash-looping. Customers are getting 502s. We need this fixed NOW."
Constraints¶
- Time pressure: You have 15 minutes before the next escalation. Production is fully down — every second counts.
- Limited access: You have Helm write access to the grokdevops namespace. You do not have access to the CI/CD pipeline to re-trigger a deploy. You must use the CLI.
Observable Evidence¶
- Dashboard: HTTP 5xx rate spiked to 100%. Uptime monitor shows the site as DOWN.
- Pod status: All pods show
CrashLoopBackOffwith 3-5 restarts.helm historyshows a new revision deployed 5 minutes ago. - Logs:
kubectl logs --previousshows application crash on startup, e.g.,KeyError: 'DATABASE_URL'— a missing environment variable from the new values file.
Expected Investigation Path¶
# 1. Assess damage — what's broken right now?
kubectl get pods -n grokdevops
helm status grokdevops -n grokdevops
helm history grokdevops -n grokdevops
# 2. Quick check on what changed
helm get values grokdevops -n grokdevops
kubectl logs -n grokdevops deploy/grokdevops --previous
# 3. Rollback immediately
helm rollback grokdevops 0 -n grokdevops --wait
# 4. Verify recovery
kubectl get pods -n grokdevops
kubectl rollout status deployment/grokdevops -n grokdevops
Strong Answer¶
"In a production incident, recovery comes before root cause analysis. First: helm history to see the current and previous revisions. Then immediately helm rollback grokdevops 0 — revision 0 means 'previous revision' — with --wait to confirm pods become healthy. Once the service is restored, I'd investigate what went wrong: compare the values (helm get values) between the broken and working revisions, check the --previous logs for the crash reason. For prevention, I'd ensure we have a pre-deploy checklist: template rendering test (helm template --debug), staging deployment first, and ideally a GitOps pipeline where all upgrades go through code review."
Common Traps¶
- Investigating before recovering — in production, restore service first
- Not knowing
helm rollbacksyntax — revision 0 = previous, not "delete everything" - Forgetting
--wait— without it, rollback returns before pods are ready - Not mentioning prevention — a senior answer includes "how do we prevent this"
Practice and Links¶
- Lab:
training/interactive/runtime-labs/lab-runtime-05-helm-upgrade-rollback/ - Runbook:
training/library/runbooks/cicd/helm_upgrade_failed.md
Wiki Navigation¶
Related Content¶
- Case Study: Pod OOMKilled — Memory Leak in Sidecar, Fix Is Helm Values (Case Study, L2) — Helm
- Helm (Topic Pack, L1) — Helm
- Helm Drills (Drill, L1) — Helm
- Helm Flashcards (CLI) (flashcard_deck, L1) — Helm
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Helm
- Lab: Helm Upgrade Rollback (CLI) (Lab, L1) — Helm
- Runbook: Helm Upgrade Failed (Runbook, L1) — Helm
- Skillcheck: Helm & Release Ops (Assessment, L1) — Helm
- Track: Helm & Release Ops (Reference, L1) — Helm
Pages that link here¶
- Helm - Primer
- Helm Debugging Decision Flow
- Helm Drills
- Helm Skill Check
- Interview Gauntlet: CI/CD for a Monorepo
- Interview Gauntlet: GitOps or Traditional CI/CD?
- Interview Gauntlet: Handling a Production Incident
- Interview Scenarios
- Level 5: SRE & Incident Response
- Runbook: Helm Upgrade Failed
- Solution: Lab Runtime 05 -- Helm Upgrade Rollback
- Symptoms: Pod OOMKilled, Memory Leak Is in Sidecar, Fix Is Helm Values
- Track: Helm & Release Operations
- Track: Incident Response