Skip to content

Portal | Level: L2: Operations | Topics: Helm | Domain: DevOps & Tooling

Scenario: Helm Upgrade Broke Prod — Recover Fast

The Prompt

"A colleague ran helm upgrade in production with a new values file. The app is now down — pods are in CrashLoopBackOff. The team is panicking. Walk me through your recovery."

Initial Report

Incident channel: "Production is DOWN. @channel someone ran helm upgrade with the wrong values file 5 minutes ago. Pods are crash-looping. Customers are getting 502s. We need this fixed NOW."

Constraints

  • Time pressure: You have 15 minutes before the next escalation. Production is fully down — every second counts.
  • Limited access: You have Helm write access to the grokdevops namespace. You do not have access to the CI/CD pipeline to re-trigger a deploy. You must use the CLI.

Observable Evidence

  • Dashboard: HTTP 5xx rate spiked to 100%. Uptime monitor shows the site as DOWN.
  • Pod status: All pods show CrashLoopBackOff with 3-5 restarts. helm history shows a new revision deployed 5 minutes ago.
  • Logs: kubectl logs --previous shows application crash on startup, e.g., KeyError: 'DATABASE_URL' — a missing environment variable from the new values file.

Expected Investigation Path

# 1. Assess damage — what's broken right now?
kubectl get pods -n grokdevops
helm status grokdevops -n grokdevops
helm history grokdevops -n grokdevops

# 2. Quick check on what changed
helm get values grokdevops -n grokdevops
kubectl logs -n grokdevops deploy/grokdevops --previous

# 3. Rollback immediately
helm rollback grokdevops 0 -n grokdevops --wait

# 4. Verify recovery
kubectl get pods -n grokdevops
kubectl rollout status deployment/grokdevops -n grokdevops

Strong Answer

"In a production incident, recovery comes before root cause analysis. First: helm history to see the current and previous revisions. Then immediately helm rollback grokdevops 0 — revision 0 means 'previous revision' — with --wait to confirm pods become healthy. Once the service is restored, I'd investigate what went wrong: compare the values (helm get values) between the broken and working revisions, check the --previous logs for the crash reason. For prevention, I'd ensure we have a pre-deploy checklist: template rendering test (helm template --debug), staging deployment first, and ideally a GitOps pipeline where all upgrades go through code review."

Common Traps

  • Investigating before recovering — in production, restore service first
  • Not knowing helm rollback syntax — revision 0 = previous, not "delete everything"
  • Forgetting --wait — without it, rollback returns before pods are ready
  • Not mentioning prevention — a senior answer includes "how do we prevent this"
  • Lab: training/interactive/runtime-labs/lab-runtime-05-helm-upgrade-rollback/
  • Runbook: training/library/runbooks/cicd/helm_upgrade_failed.md

Wiki Navigation