Thinking Out Loud: CrashLoopBackOff¶
A senior SRE's internal monologue while working through a real CrashLoopBackOff incident. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
After a deploy of the user-auth service (version 3.14.0), all pods are in CrashLoopBackOff. The previous version (3.13.2) was running fine. The deploy was automated through ArgoCD and the team is asking me to help while they investigate the code change.
The Monologue¶
CrashLoopBackOff after a deploy. Step one: don't roll back yet. Let me understand what's happening so we fix it properly. But also — I'm going to set a timer. If I can't figure this out in 5 minutes, I'm rolling back regardless. User auth being down is a showstopper.
All 4 pods in CrashLoopBackOff, restarts ranging from 3 to 7. The backoff is increasing. Let me grab the logs from the most recent crash.
Error: ECONNREFUSED 10.96.22.18:6379. It can't connect to Redis. But wait — the previous version was connecting to Redis just fine. Did someone change the Redis address in the new config?
REDIS_HOST: user-auth-redis.auth.svc.cluster.local. That looks correct. Let me verify Redis is actually up.
Redis pod is Running. Let me test connectivity.
PONG. Redis is fine. So the connection string is right, Redis is up, but the app can't connect. Let me look more carefully at the logs.
Full log: "Initializing user-auth v3.14.0... Loading config... REDIS_URL=redis://user-auth-redis:6379/0... Connecting to Redis... Error: ECONNREFUSED". Wait — REDIS_URL not REDIS_HOST. The new version changed the environment variable name from REDIS_HOST to REDIS_URL and expects a full Redis URL format.
Let me check what env vars the pod is actually getting.
kubectl exec user-auth-6f8d9b-q2m4x -n auth -- env | grep -i redis 2>/dev/null || \
kubectl get deployment user-auth -n auth -o jsonpath='{.spec.template.spec.containers[0].env[*]}' | jq .
Mental Model: The Config-Code Version Mismatch¶
When a deploy introduces CrashLoopBackOff, the most common cause isn't a code bug — it's a mismatch between the new code's expectations and the existing configuration. New version expects a new env var, a different secret key, a renamed config field, or a new dependency. Always diff the config expectations between the old and new version.
The pod has REDIS_HOST=user-auth-redis.auth.svc.cluster.local and REDIS_PORT=6379 from the ConfigMap. But v3.14.0 expects REDIS_URL=redis://host:port/db. The ConfigMap wasn't updated to match the new code.
Two options: update the ConfigMap to provide REDIS_URL, or roll back to v3.13.2. Since the deploy was automated by ArgoCD, the ConfigMap should have been updated in the same commit. Let me check the ArgoCD sync status.
Synced. ArgoCD thinks everything is in sync. Let me check the git repo to see if the ConfigMap was supposed to be updated.
Revision a3f8b21. Let me check if that commit includes the ConfigMap change... actually, I don't have the git repo handy. Let me just look at what ArgoCD applied.
kubectl get application user-auth -n argocd -o jsonpath='{.status.resources[*]}' | jq '.[] | select(.kind=="ConfigMap")'
The ConfigMap resource shows Synced and Healthy. So either the git repo has the wrong ConfigMap, or the developer forgot to update it in the same PR.
Okay, I've spent 3 of my 5 minutes. I know the problem — the ConfigMap doesn't match the new code. Let me fix it now and talk to the dev later about why the config change wasn't in their PR.
kubectl patch configmap user-auth-config -n auth --type merge -p '{"data":{"REDIS_URL":"redis://user-auth-redis.auth.svc.cluster.local:6379/0"}}'
Now I need the pods to pick up the new config. Since they're reading env vars at startup and they're in CrashLoopBackOff, they'll restart automatically. But the backoff delay might be long by now. Let me force a restart.
kubectl rollout restart deployment/user-auth -n auth
kubectl rollout status deployment/user-auth -n auth --timeout=60s
New pods coming up... Running... Ready. Let me verify they're actually working.
"Connected to Redis. Listening on port 8080." We're good.
But wait — I need to make sure the old env vars are also still there. If v3.14.0 only reads REDIS_URL but some other part of the system or a sidecar reads REDIS_HOST, removing it would break something else. I'll leave both in the ConfigMap for now.
Mental Model: Additive Config Changes¶
When updating configuration for a new version, add the new keys but don't remove the old ones until you've verified nothing else reads them. Config keys are like API endpoints — they might have consumers you don't know about. Deprecate, don't delete.
Last thing — I need to make sure ArgoCD doesn't revert my ConfigMap change on the next sync. Since I changed it out-of-band, ArgoCD will see drift and might revert it.
OutOfSync. Yeah, ArgoCD sees the difference. I need to update the git repo too, or ArgoCD will blow away my fix. Let me ping the dev team to add the ConfigMap change to their repo. In the meantime, I'll set a manual sync window to prevent auto-sync from reverting.
Actually, the safer move is to just go update the git repo directly. The ConfigMap is in the Helm values file.
Users are unblocked. Auth is working. I'll follow up with the dev team about including config changes in the same PR as code changes that depend on them.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| Immediately roll back the deploy | Set a time box (5 minutes) to diagnose, then roll back if no progress | If the fix is simple (config change), rolling back delays the release and doesn't address the root cause |
| Read the error message at face value ("can't connect to Redis") and debug Redis | Notice the connection string format difference between versions | The error says "can't connect" but the real issue is "reading the wrong env var" |
| Fix the ConfigMap and call it done | Also address the ArgoCD sync drift to prevent the fix from being reverted | Out-of-band changes get overwritten by GitOps tools — the fix needs to be in git |
| Remove the old config keys when adding new ones | Keep both old and new keys during the transition | Other components might depend on the old keys — deprecate, don't delete |
Key Heuristics Used¶
- Config-Code Version Mismatch: When a deploy crashes, diff the new code's config expectations against the existing config before debugging the code itself.
- Time-Box Diagnosis: Set a hard timer (5 minutes for critical services) — if you can't diagnose in time, roll back and debug offline.
- GitOps Awareness: Manual fixes to resources managed by GitOps tools will be reverted on the next sync. Always update the git source of truth alongside the live fix.
Cross-References¶
- Primer — CrashLoopBackOff lifecycle, backoff timing, and exit code interpretation
- Street Ops — The
--previouslogs trick and container state inspection - Footguns — Config-code mismatches and GitOps revert traps