Portal | Level: L1: Foundations | Topics: Helm | Domain: DevOps & Tooling
Runbook: Helm Upgrade Failed¶
Symptoms¶
helm upgradereturns error or times out- Release stuck in
pending-upgradeorfailedstate - New pods not rolling out
Fast Triage¶
helm list -n grokdevops
helm history grokdevops -n grokdevops
helm status grokdevops -n grokdevops
kubectl get pods -n grokdevops
kubectl get events -n grokdevops --sort-by='.lastTimestamp' | tail -20
Likely Causes (ranked)¶
- Bad values — invalid YAML, wrong image tag, missing required field
- Template rendering error — Helm template syntax issue
- Resource conflict — CRD not installed (e.g., ServiceMonitor without prometheus-operator)
- Timeout — pods didn't become ready in time (probe, resource, image issue)
- Stuck release — previous failed upgrade left release in bad state
Evidence Interpretation¶
What bad looks like:
$ helm history grokdevops -n grokdevops
REVISION STATUS DESCRIPTION
1 deployed Install complete
2 failed Upgrade "grokdevops" failed: timed out waiting for the condition
failed status means the upgrade ran but resources did not become healthy in time (or template rendering errored).
- pending-upgrade means Helm started the upgrade but never finished — the release is locked and further upgrades will be rejected until you rollback.
- Check helm status for the error message and kubectl get events for what went wrong at the Kubernetes level.
Fix Steps¶
- If bad values, test template rendering:
- Rollback to last working revision:
- If release is stuck in
pending-upgrade: - Fix values and retry:
Verification¶
helm status grokdevops -n grokdevops # STATUS: deployed
kubectl rollout status deployment/grokdevops -n grokdevops
Cleanup¶
Clean up failed revisions:
Unknown Unknowns¶
- Helm stores each release revision as a Secret in the namespace (type
helm.sh/release.v1). You can inspect them withkubectl get secrets -l owner=helm. helm rollbackdoes not delete the failed revision — it creates a new revision with the old config. Revision numbers only go up.- The
--atomicflag onhelm upgradeauto-rolls back on failure, preventing stuckpending-upgradestates. - Helm uses a 3-way merge (old manifest, new manifest, live state). If someone edited a resource with
kubectl edit, the merge can produce surprises.
[!WARNING] Never run
helm uninstallto "fix" a failed upgrade. Uninstalling deletes all managed resources (Deployments, Services, PVCs if unprotected). Usehelm rollbackinstead — it creates a new revision with the last known-good config without destroying anything.
Pitfalls¶
- Running upgrade again without fixing values — the same bad config will fail again and add another failed revision.
- Deleting the release instead of rolling back —
helm uninstallremoves all managed resources (including PVCs if not protected). Usehelm rollbackinstead. - Not using
--dry-runfirst —helm upgrade --dry-runcatches template errors and bad values before touching the cluster.
See Also¶
training/library/guides/troubleshooting.md(Helm section)training/interactive/runtime-labs/lab-runtime-05-helm-upgrade-rollback/training/interview-scenarios/05-helm-upgrade-broke-prod.mdtraining/interactive/incidents/scenarios/helm-upgrade-bad-values.sh
Wiki Navigation¶
Related Content¶
- Case Study: Pod OOMKilled — Memory Leak in Sidecar, Fix Is Helm Values (Case Study, L2) — Helm
- Helm (Topic Pack, L1) — Helm
- Helm Drills (Drill, L1) — Helm
- Helm Flashcards (CLI) (flashcard_deck, L1) — Helm
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Helm
- Interview: Helm Upgrade Broke Prod (Scenario, L2) — Helm
- Lab: Helm Upgrade Rollback (CLI) (Lab, L1) — Helm
- Skillcheck: Helm & Release Ops (Assessment, L1) — Helm
- Track: Helm & Release Ops (Reference, L1) — Helm
Pages that link here¶
- Decision Tree: Deployment Is Stuck
- Decision Tree: Latency Has Increased
- Decision Tree: Service Returning 5xx Errors
- DevOps Tooling Domain
- Helm
- Helm - Primer
- Helm - Street-Level Ops
- Helm Debugging Decision Flow
- Helm Drills
- Helm Skill Check
- Level 4: Operations & Observability
- Operational Runbooks
- Scenario: Helm Upgrade Broke Prod — Recover Fast
- Solution: Lab Runtime 05 -- Helm Upgrade Rollback
- Symptoms: Pod OOMKilled, Memory Leak Is in Sidecar, Fix Is Helm Values