Postmortem: Missing Runbook Extends CrashLoopBackOff Recovery by 45 Minutes¶
| Field | Value |
|---|---|
| ID | PM-010 |
| Date | 2025-07-08 |
| Severity | SEV-2 |
| Duration | 58m (detection to resolution) |
| Time to Detect | 4m |
| Time to Mitigate | 58m |
| Customer Impact | New login attempts failed for ~53 minutes; ~1,900 users unable to authenticate; existing sessions unaffected for first 30 minutes |
| Revenue Impact | ~$6,200 estimated (failed new-session e-commerce flows; 3 enterprise SSO integrations reported login failures) |
| Teams Involved | Identity Platform, Platform Engineering, Incident Command |
| Postmortem Author | Chidinma Okafor |
| Postmortem Date | 2025-07-11 |
Executive Summary¶
On 2025-07-08, a ConfigMap update to the auth-service introduced invalid YAML (an unquoted string containing a colon), causing all auth-service pods to fail on startup and enter CrashLoopBackOff. The junior engineer on first solo on-call rotation, Elliot Bauer, detected the issue quickly but lacked the runbook or prior experience to understand that kubectl edit on a Helm-managed ConfigMap would not persist and that the correct remediation was a helm rollback. After 35 minutes of unsuccessful attempts and growing customer impact, Elliot escalated to senior engineer Preethi Sundaram, who executed a Helm rollback in under 5 minutes. The 45-minute recovery delay was entirely attributable to the absence of a "service fails to start after deploy" runbook and insufficient on-call onboarding for junior engineers.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 09:41 | Identity Platform engineer Callum Frost deploys auth-service v2.11.3 via CI/CD pipeline; change includes a ConfigMap update adding a new feature flag: rate_limit_strategy: adaptive: true |
| 09:41 | The YAML value adaptive: true is unquoted and contains a colon; Kubernetes YAML parser treats this as a nested map key, not a string value; ConfigMap is created successfully (Kubernetes does not validate app-specific YAML semantics) |
| 09:42 | auth-service pods begin rolling restart; each pod reads the ConfigMap on startup and attempts to parse the config |
| 09:43 | auth-service config parser throws yaml.scanner.ScannerError: mapping values are not allowed here; all pods exit with code 1 |
| 09:44 | PagerDuty alert: auth-service pod restart count exceeds threshold; Elliot Bauer (on-call, first solo rotation) acknowledges |
| 09:45 | Elliot runs kubectl get pods -n identity; sees all 4 auth-service pods in CrashLoopBackOff |
| 09:46 | Elliot runs kubectl logs auth-service-7c9f4b-xkp -n identity --previous; sees ScannerError on the YAML config |
| 09:47 | Elliot identifies the malformed ConfigMap key as the likely cause; runs kubectl describe configmap auth-config -n identity |
| 09:48 | Elliot attempts kubectl edit configmap auth-config -n identity; corrects the YAML, saves |
| 09:48 | Edit is accepted by the API server; ConfigMap is updated in etcd |
| 09:49 | Pods do not restart; Elliot is unsure why (Kubernetes does not automatically restart pods on ConfigMap changes; pods need to be deleted or the Deployment rolled) |
| 09:50 | Elliot runs kubectl rollout restart deployment/auth-service -n identity; pods restart |
| 09:51 | Pods read the updated ConfigMap, start successfully — but only for 90 seconds |
| 09:52 | ArgoCD detects drift between the live ConfigMap (Elliot's manual edit) and the Helm chart definition; ArgoCD auto-sync reverts the ConfigMap to the Helm-managed version (the broken one) |
| 09:53 | All 4 pods re-enter CrashLoopBackOff after ArgoCD sync |
| 09:54 | Elliot sees pods crashing again; confusion about why the fix "undid itself" |
| 09:55 | Elliot begins reading Kubernetes documentation on ConfigMap management |
| 10:04 | Elliot posts in #platform-help: "auth-service is crashlooping, I fixed the configmap but it keeps reverting, not sure why" |
| 10:06 | No immediate response in #platform-help (senior engineers not actively monitoring this channel) |
| 10:07 | Existing authenticated user sessions begin expiring (60-second grace period had elapsed 30 minutes earlier; session renewal attempts now fail) |
| 10:08 | Customer support begins receiving login failure reports; volume spikes to 40 tickets in 10 minutes |
| 10:14 | Callum Frost sees Elliot's message in #platform-help; joins the incident |
| 10:15 | Callum identifies that ArgoCD is managing the deployment and that kubectl edit changes are being reverted; explains this to Elliot |
| 10:16 | Callum realizes a helm rollback is needed but wants confirmation from Incident Commander before modifying a production Helm release |
| 10:17 | Callum posts in #incidents to find an Incident Commander; Chidinma Okafor (IC) is paged via PagerDuty after no response for 2 minutes |
| 10:19 | Chidinma joins; immediately authorizes rollback |
| 10:20 | Callum runs helm rollback auth-service 1 -n identity |
| 10:21 | ArgoCD is temporarily set to manual sync to prevent interference during rollback |
| 10:22 | Pods restart with v2.11.2 ConfigMap; all 4 pods reach Running state |
| 10:24 | New login attempts succeed; error rate drops to zero |
| 10:28 | Auth service health confirmed via synthetic login test; ArgoCD sync re-enabled (pointing to v2.11.2) |
| 10:42 | All-clear declared; customer support ticket backlog assigned for follow-up |
Impact¶
Customer Impact¶
New authentication attempts failed for approximately 53 minutes (09:44–10:24 UTC). Approximately 1,900 unique users attempted to log in during this window and received authentication errors. Existing sessions with active tokens were not affected for the first 30 minutes (09:44–10:14 UTC) due to the auth service's 60-second session renewal grace period; after that window, users whose tokens expired also began experiencing failures. Three enterprise customers using SAML SSO integrations reported login failures to their account managers, triggering SLA review. Customer support received 187 tickets during the incident window, of which 142 were directly related to login failures.
Internal Impact¶
Incident response consumed approximately 9 engineer-hours: Elliot (2h active response + 1h post-incident debrief), Callum (1.5h), Chidinma (2h IC duties + 1.5h postmortem coordination), Preethi Sundaram (30m escalation response + 30m mentoring call with Elliot). Identity Platform team's afternoon sprint was disrupted, deferring two feature reviews. Customer support spent an estimated 4 additional hours on ticket follow-up.
Data Impact¶
No data loss or corruption. The CrashLoopBackOff was a pure availability failure; no writes were in-flight and no database state was modified during the crash cycles. Session tokens issued before the incident remained valid and were honored once the service recovered.
Root Cause¶
What Happened (Technical)¶
The auth-service ConfigMap is rendered by the Helm chart from a template at templates/configmap.yaml. The values file update in v2.11.3 added a new key intended to be a string:
However, the value was committed without quotes in values.yaml:
This is syntactically ambiguous in YAML 1.1 (which Kubernetes uses). The Helm template engine rendered this as a nested mapping during chart rendering, producing a ConfigMap with a malformed value. The Python application's pyyaml-based config parser, which reloads the ConfigMap at startup, threw yaml.scanner.ScannerError on line 14 of the mounted config file and exited with a non-zero return code before serving any requests.
Because the Helm chart is managed by ArgoCD with automatic synchronization enabled (syncPolicy: automated: prune: true, selfHeal: true), any manual kubectl edit on a Helm-managed resource is detected as drift and reverted within 60–90 seconds. Elliot's fix temporarily resolved the issue but was undone by ArgoCD's self-healing within 90 seconds of the corrected pods starting. Elliot did not know that ArgoCD was managing the deployment or that kubectl edit on Helm-managed resources is ineffective in an auto-sync environment. This information was not part of on-call onboarding.
The escalation delay was compounded by the organizational pattern of posting help requests in #platform-help rather than #incidents for anything that felt uncertain. The #platform-help channel has no on-call rotation monitor; senior engineers check it opportunistically. The #incidents channel, by contrast, triggers PagerDuty escalation after 2 minutes of no IC acknowledgement.
Contributing Factors¶
-
Runbook for "service fails to start after deploy" did not exist: The on-call runbook library covered application-level errors (database connection failures, downstream timeouts) and infrastructure failures (node OOM, disk full) but had no entry for the most common deploy-related failure mode: a service that exits immediately on startup due to a config error. Had a runbook existed, it would have directed Elliot to
helm rollbackas the first remediation step for a Helm-managed service in CrashLoopBackOff following a deploy. -
On-call onboarding did not include a simulated incident exercise: Elliot had completed the written on-call onboarding checklist (reading runbooks, shadowing two on-call rotations) but had never worked through a hands-on incident simulation. The company had a documented "chaos training" program for on-call engineers, but participation was optional and Elliot had not yet attended. The distinction between
kubectl-managed resources and Helm/ArgoCD-managed resources — a critical operational concept — was not covered in the written material. -
The escalation channel (#platform-help) was buried in the Slack channel list: Elliot's instinct to seek help was correct, but the channel chosen (#platform-help) was not monitored by on-call engineers and had no PagerDuty integration. The correct channel (#incidents) was not prominently linked in the on-call handbook. The 10-minute gap between Elliot's help request (10:04) and Callum joining (10:14) was due entirely to the wrong channel choice.
What We Got Lucky About¶
- The
auth-servicehad a 60-second session token renewal grace period implemented specifically for rolling deploy scenarios. This meant that existing authenticated users did not experience failures for the first 30 minutes of the incident. Had this grace period not existed, the customer impact would have been felt within seconds of the CrashLoopBackOff beginning rather than gradually after session expiry. - Callum Frost — the engineer who deployed the change — was actively monitoring Slack and saw Elliot's message in #platform-help within 10 minutes. A deployment author who was in meetings or off-shift would have extended the diagnosis phase considerably, as Elliot had not yet connected the CrashLoopBackOff to the recent deploy.
Detection¶
How We Detected¶
PagerDuty alert on auth-service pod restart count exceeding threshold (4 restarts in 5 minutes, 09:44 UTC). The alert fired 3 minutes after the deployment completed and pods began crash-looping. Detection was fast and automated.
Why We Didn't Detect Sooner¶
Detection was appropriate for the failure mode. The 4-minute window (09:41 deploy → 09:44 alert) reflects the time for pods to enter CrashLoopBackOff and accumulate restart counts past the threshold. A faster detection path would be a deploy hook that runs a smoke test immediately after rollout and pages if the smoke test fails — this would have detected the startup failure within 60 seconds of the deploy completing. No post-deploy smoke test exists for auth-service.
Response¶
What Went Well¶
- Elliot's initial diagnosis was correct: he identified the malformed ConfigMap within 3 minutes of acknowledging the alert by reading pod logs. The diagnosis skills were there; the remediation knowledge was not.
- Callum's decision to seek IC authorization before running
helm rollbackon a production release showed good discipline about blast radius. The 2-minute IC page delay was a worthwhile tradeoff for the assurance that the rollback was sanctioned. - The 60-second grace period in the auth service materially limited customer impact, demonstrating the value of designing auth systems with deploy-safe session renewal windows.
What Went Poorly¶
- Thirty-five minutes elapsed between detection and escalation to a senior engineer. On-call engineers — especially those on their first solo rotation — should have a defined "escalate if not resolved in X minutes" rule embedded in the runbook. The current on-call handbook says "escalate when needed" without a time bound.
- The ArgoCD self-healing behavior was completely opaque to Elliot. When his fix "undid itself," he had no framework for understanding why. This class of confusion — "I fixed it and it broke again" — should have its own runbook entry.
- The #platform-help vs. #incidents distinction was not clearly communicated in on-call materials. Elliot used the wrong channel for 10 minutes. The on-call handbook should prominently state: "For active incidents, use #incidents only. #platform-help is not monitored during incidents."
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| AI-010-01 | Write runbook: "Service CrashLoopBackOff After Deploy" covering: read logs, identify config vs code error, determine if Helm-managed (check helm list), run helm rollback, verify ArgoCD sync state |
P0 | Identity Platform / Platform Eng | Open | 2025-07-15 |
| AI-010-02 | Add mandatory chaos training simulation to on-call onboarding; engineers must complete a simulated CrashLoopBackOff scenario before first solo rotation | P0 | Incident Command | Open | 2025-07-25 |
| AI-010-03 | Add to on-call handbook (prominent callout box): escalate to senior engineer after 15 minutes without a clear mitigation path; do not wait until the situation feels "bad enough" | P1 | Incident Command | Open | 2025-07-15 |
| AI-010-04 | Add post-deploy smoke test for auth-service: hit /healthz and /auth/token with test credentials 60 seconds after rollout; page on-call if either fails |
P1 | Identity Platform | Open | 2025-07-22 |
| AI-010-05 | Add YAML linting step to auth-service CI pipeline; validate that all values in values.yaml parse correctly as their declared types before Helm template rendering |
P1 | Identity Platform | Open | 2025-07-22 |
| AI-010-06 | Update on-call handbook: replace "escalate when needed" with explicit time box; add prominent #incidents channel link with note that #platform-help is not monitored during incidents | P2 | Incident Command | Open | 2025-07-15 |
Lessons Learned¶
- Runbooks are not documentation overhead — they are risk mitigation for your worst moments: A 5-minute Helm rollback took 53 minutes because there was no runbook. The knowledge existed in the organization; it was simply not written down in a place accessible during a high-stress incident. Every Helm-managed production service should have a "fails to start after deploy" runbook entry as a baseline.
- ArgoCD self-healing is invisible until you know to look for it: Engineers unfamiliar with GitOps workflows will interpret ArgoCD reverting their changes as "the fix didn't work" and waste time repeating the same broken approach. The operational model of ArgoCD — especially the interaction between
kubectl editand auto-sync — must be part of on-call onboarding, not optional documentation. - Junior on-call engineers need time-boxed escalation norms, not open-ended judgment calls: Asking a first-time solo on-call engineer to self-assess "is this bad enough to escalate?" while simultaneously diagnosing an unfamiliar failure and watching customer impact grow is an unfair cognitive burden. A mandatory escalation after 15 minutes without a mitigation path removes the social friction and protects both the engineer and the customers.
Cross-References¶
- Failure Pattern: Process Gap — Missing Runbook; Operational Knowledge Silo; GitOps Drift Revert Confusion
- Topic Packs: Kubernetes CrashLoopBackOff Triage; Helm and ArgoCD Operations; On-Call Readiness; Incident Escalation
- Runbook:
runbooks/deployments/crashloop-after-deploy.md(to be created per AI-010-01);runbooks/deployments/helm-rollback.md;runbooks/incidents/escalation-policy.md - Decision Tree: Pod CrashLoopBackOff → Read
--previouslogs → Config parse error? → Check if Helm-managed (helm list -n <namespace>) → Yes:helm rollback <release> <revision>→ Verify ArgoCD sync is not set to auto-revert → No: Edit Deployment/ConfigMap directly