Thinking Out Loud: Helm¶

A senior SRE's internal monologue while working through a real Helm issue. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation¶

A Helm upgrade for the api-gateway service failed midway, leaving the release in a FAILED state. The previous version is still running (3 pods), but Helm refuses to do another upgrade because of the failed state. The team needs to deploy a hotfix urgently.

The Monologue¶

Helm release in FAILED state blocking deploys. This is one of the most stressful Helm scenarios because you're stuck — you can't go forward (upgrade fails) and rolling back might make things worse if the failed release partially applied resources.

Let me assess the damage first.

helm list -n gateway --all

api-gateway — revision 14, status FAILED, chart api-gateway-2.5.0. The previous successful release was revision 13. Let me see what happened during the failed upgrade.

helm history api-gateway -n gateway

Revision 13: DEPLOYED (2.4.3). Revision 14: FAILED (2.5.0). The upgrade from 2.4.3 to 2.5.0 failed. Let me check what actually changed.

helm get manifest api-gateway -n gateway --revision 13 > /tmp/rev13.yaml
helm get manifest api-gateway -n gateway --revision 14 > /tmp/rev14.yaml
diff /tmp/rev13.yaml /tmp/rev14.yaml | head -60

The failed release added a new CRD and modified the deployment spec. Let me check what resources were actually applied.

kubectl get all -n gateway -l app.kubernetes.io/instance=api-gateway

3 pods running (from revision 13), the service is up, the ingress is intact. The failed revision didn't take down the running pods. Good — Helm's default upgrade strategy keeps the old pods running until the new ones are ready.

Mental Model: Helm Failure States¶

A FAILED Helm release doesn't necessarily mean resources are broken. Helm marks a release as FAILED if any post-install/post-upgrade hook fails, if a timeout occurs waiting for resources to be ready, or if the Kubernetes API returns errors during apply. The existing resources from the previous successful release usually continue running. Always check the actual cluster state, not just the Helm status.

Let me check what caused the failure.

helm get notes api-gateway -n gateway --revision 14 2>/dev/null
kubectl get events -n gateway --sort-by=.lastTimestamp | tail -20

Events show: "Job api-gateway-migrate failed." It's a database migration job that runs as a post-upgrade hook. The migration failed, which caused Helm to mark the entire release as FAILED.

kubectl logs job/api-gateway-migrate -n gateway

"ERROR: relation 'api_routes_v2' already exists." The migration was trying to create a table that already exists — probably from a previous failed attempt. The migration isn't idempotent. Classic.

Now I need to: (1) fix the release state so I can do another upgrade, and (2) handle the migration issue.

Option 1: helm rollback to revision 13. This reverts the Helm release to the previous state, but it won't undo the partially-created database table. And it creates revision 15 which is a copy of revision 13.

Option 2: Fix the migration, then helm upgrade again. Helm will create revision 15 with the same chart version (2.5.0).

Option 3: helm upgrade with --force which deletes and recreates the release.

I'm going with Option 2. Let me fix the migration first, then retry the upgrade.

# First, clean up the failed migration job
kubectl delete job api-gateway-migrate -n gateway

Now I need to make the migration idempotent. The chart's migration hook should use IF NOT EXISTS in the CREATE TABLE statements. But I can't change the chart right now for the hotfix. Let me workaround it.

Actually, the hotfix is a code change, not a chart change. Let me check if the hotfix version (let's say 2.5.1) has the migration fix.

helm show chart api-gateway --version 2.5.1 2>/dev/null || echo "2.5.1 not available"

Not available. The team is building 2.5.1 now with the fix. In the meantime, let me unblock the deploy pipeline by handling the FAILED release state.

Mental Model: Helm Rollback as State Reset¶

When a Helm release is in FAILED state, helm rollback to the last successful revision resets the state machine. It doesn't necessarily change what's running in the cluster (if the failed release didn't modify resources), but it clears the FAILED state so you can attempt another upgrade. Think of rollback as "resetting the Helm state to match the last known good state."

helm rollback api-gateway 13 -n gateway

"Rollback was a success!" Status is now DEPLOYED again at revision 15 (which is a copy of revision 13's manifest). The cluster state hasn't changed — the same 3 pods are running.

Now let me handle the migration table that was partially created. I'll drop it so the next migration can create it cleanly.

kubectl exec -it postgresql-0 -n gateway -- psql -U apigateway -c "DROP TABLE IF EXISTS api_routes_v2;"

Table dropped. When the team has the 2.5.1 chart ready with the idempotent migration, I can do a clean upgrade. But first, let me add some guardrails.

For the upgrade, I want to do a dry run first, then apply with a longer timeout since the migration might take a while.

# When 2.5.1 is ready:
helm upgrade api-gateway ./api-gateway --version 2.5.1 -n gateway \
  -f values-prod.yaml \
  --dry-run --debug

Dry run looks good. Now the actual upgrade with extra timeout for the migration hook.

helm upgrade api-gateway ./api-gateway --version 2.5.1 -n gateway \
  -f values-prod.yaml \
  --timeout 10m \
  --wait \
  --atomic

The --atomic flag is crucial here — if the upgrade fails again, it automatically rolls back instead of leaving another FAILED release. This is what we should have been using all along.

Let me verify the upgrade succeeded.

helm list -n gateway
kubectl get pods -n gateway -l app.kubernetes.io/instance=api-gateway
kubectl logs job/api-gateway-migrate -n gateway --tail=5

Revision 16, DEPLOYED, 2.5.1. Migration ran successfully. New pods are running. Good.

Let me make a note: all future Helm upgrades for services with migration hooks should use --atomic --timeout 10m. And the migration scripts need to be idempotent. I'll add both to the team's deployment runbook.

What Made This Senior-Level¶

Junior Would...	Senior Does...	Why
Panic when Helm shows FAILED status	Check the actual cluster state — FAILED release doesn't mean broken infrastructure	Helm status and cluster state are different things; pods from the last successful release keep running
Use `helm upgrade --force` to blast through the failure	Rollback to the last successful revision to reset state cleanly	`--force` deletes and recreates resources, causing downtime; rollback just resets state
Not investigate why the migration failed before retrying	Check the migration logs, fix the root cause (non-idempotent migration), then retry	Retrying the same failing migration produces the same failure
Not use `--atomic` on subsequent upgrades	Always use `--atomic` for services with hooks	`--atomic` auto-rolls back on failure instead of leaving another FAILED release

Key Heuristics Used¶

Check Cluster State, Not Just Helm State: A FAILED Helm release doesn't mean the service is down. Always verify what's actually running in the cluster.
Rollback as State Reset: Use helm rollback to clear FAILED state and reset to the last known good revision. It's a state machine operation, not necessarily a resource change.
Always Use --atomic for Upgrades: --atomic automatically rolls back failed upgrades. This prevents the "FAILED release blocking deploys" scenario entirely.

Cross-References¶

Primer — Helm release lifecycle, revision model, and hooks
Street Ops — Helm debugging commands, rollback procedures, and diff tools
Footguns — Failed releases blocking deploys, non-idempotent migration hooks, and using --force in production