Postmortem: Missing Runbook Extends CrashLoopBackOff Recovery by 45 Minutes¶

Field	Value
ID	PM-010
Date	2025-07-08
Severity	SEV-2
Duration	58m (detection to resolution)
Time to Detect	4m
Time to Mitigate	58m
Customer Impact	New login attempts failed for ~53 minutes; ~1,900 users unable to authenticate; existing sessions unaffected for first 30 minutes
Revenue Impact	~$6,200 estimated (failed new-session e-commerce flows; 3 enterprise SSO integrations reported login failures)
Teams Involved	Identity Platform, Platform Engineering, Incident Command
Postmortem Author	Chidinma Okafor
Postmortem Date	2025-07-11

Executive Summary¶

On 2025-07-08, a ConfigMap update to the auth-service introduced invalid YAML (an unquoted string containing a colon), causing all auth-service pods to fail on startup and enter CrashLoopBackOff. The junior engineer on first solo on-call rotation, Elliot Bauer, detected the issue quickly but lacked the runbook or prior experience to understand that kubectl edit on a Helm-managed ConfigMap would not persist and that the correct remediation was a helm rollback. After 35 minutes of unsuccessful attempts and growing customer impact, Elliot escalated to senior engineer Preethi Sundaram, who executed a Helm rollback in under 5 minutes. The 45-minute recovery delay was entirely attributable to the absence of a "service fails to start after deploy" runbook and insufficient on-call onboarding for junior engineers.

Timeline (All times UTC)¶

Time	Event
09:41	Identity Platform engineer Callum Frost deploys `auth-service` v2.11.3 via CI/CD pipeline; change includes a ConfigMap update adding a new feature flag: `rate_limit_strategy: adaptive: true`
09:41	The YAML value `adaptive: true` is unquoted and contains a colon; Kubernetes YAML parser treats this as a nested map key, not a string value; ConfigMap is created successfully (Kubernetes does not validate app-specific YAML semantics)
09:42	`auth-service` pods begin rolling restart; each pod reads the ConfigMap on startup and attempts to parse the config
09:43	`auth-service` config parser throws `yaml.scanner.ScannerError: mapping values are not allowed here`; all pods exit with code 1
09:44	PagerDuty alert: `auth-service` pod restart count exceeds threshold; Elliot Bauer (on-call, first solo rotation) acknowledges
09:45	Elliot runs `kubectl get pods -n identity`; sees all 4 `auth-service` pods in `CrashLoopBackOff`
09:46	Elliot runs `kubectl logs auth-service-7c9f4b-xkp -n identity --previous`; sees `ScannerError` on the YAML config
09:47	Elliot identifies the malformed ConfigMap key as the likely cause; runs `kubectl describe configmap auth-config -n identity`
09:48	Elliot attempts `kubectl edit configmap auth-config -n identity`; corrects the YAML, saves
09:48	Edit is accepted by the API server; ConfigMap is updated in etcd
09:49	Pods do not restart; Elliot is unsure why (Kubernetes does not automatically restart pods on ConfigMap changes; pods need to be deleted or the Deployment rolled)
09:50	Elliot runs `kubectl rollout restart deployment/auth-service -n identity`; pods restart
09:51	Pods read the updated ConfigMap, start successfully — but only for 90 seconds
09:52	ArgoCD detects drift between the live ConfigMap (Elliot's manual edit) and the Helm chart definition; ArgoCD auto-sync reverts the ConfigMap to the Helm-managed version (the broken one)
09:53	All 4 pods re-enter CrashLoopBackOff after ArgoCD sync
09:54	Elliot sees pods crashing again; confusion about why the fix "undid itself"
09:55	Elliot begins reading Kubernetes documentation on ConfigMap management
10:04	Elliot posts in #platform-help: "auth-service is crashlooping, I fixed the configmap but it keeps reverting, not sure why"
10:06	No immediate response in #platform-help (senior engineers not actively monitoring this channel)
10:07	Existing authenticated user sessions begin expiring (60-second grace period had elapsed 30 minutes earlier; session renewal attempts now fail)
10:08	Customer support begins receiving login failure reports; volume spikes to 40 tickets in 10 minutes
10:14	Callum Frost sees Elliot's message in #platform-help; joins the incident
10:15	Callum identifies that ArgoCD is managing the deployment and that `kubectl edit` changes are being reverted; explains this to Elliot
10:16	Callum realizes a `helm rollback` is needed but wants confirmation from Incident Commander before modifying a production Helm release
10:17	Callum posts in #incidents to find an Incident Commander; Chidinma Okafor (IC) is paged via PagerDuty after no response for 2 minutes
10:19	Chidinma joins; immediately authorizes rollback
10:20	Callum runs `helm rollback auth-service 1 -n identity`
10:21	ArgoCD is temporarily set to manual sync to prevent interference during rollback
10:22	Pods restart with v2.11.2 ConfigMap; all 4 pods reach `Running` state
10:24	New login attempts succeed; error rate drops to zero
10:28	Auth service health confirmed via synthetic login test; ArgoCD sync re-enabled (pointing to v2.11.2)
10:42	All-clear declared; customer support ticket backlog assigned for follow-up

Impact¶

Customer Impact¶

New authentication attempts failed for approximately 53 minutes (09:44–10:24 UTC). Approximately 1,900 unique users attempted to log in during this window and received authentication errors. Existing sessions with active tokens were not affected for the first 30 minutes (09:44–10:14 UTC) due to the auth service's 60-second session renewal grace period; after that window, users whose tokens expired also began experiencing failures. Three enterprise customers using SAML SSO integrations reported login failures to their account managers, triggering SLA review. Customer support received 187 tickets during the incident window, of which 142 were directly related to login failures.

Internal Impact¶

Incident response consumed approximately 9 engineer-hours: Elliot (2h active response + 1h post-incident debrief), Callum (1.5h), Chidinma (2h IC duties + 1.5h postmortem coordination), Preethi Sundaram (30m escalation response + 30m mentoring call with Elliot). Identity Platform team's afternoon sprint was disrupted, deferring two feature reviews. Customer support spent an estimated 4 additional hours on ticket follow-up.

Data Impact¶

No data loss or corruption. The CrashLoopBackOff was a pure availability failure; no writes were in-flight and no database state was modified during the crash cycles. Session tokens issued before the incident remained valid and were honored once the service recovered.

Root Cause¶

What Happened (Technical)¶

The auth-service ConfigMap is rendered by the Helm chart from a template at templates/configmap.yaml. The values file update in v2.11.3 added a new key intended to be a string:

rate_limit_strategy: "adaptive: true"

However, the value was committed without quotes in values.yaml:

rate_limit_strategy: adaptive: true

This is syntactically ambiguous in YAML 1.1 (which Kubernetes uses). The Helm template engine rendered this as a nested mapping during chart rendering, producing a ConfigMap with a malformed value. The Python application's pyyaml-based config parser, which reloads the ConfigMap at startup, threw yaml.scanner.ScannerError on line 14 of the mounted config file and exited with a non-zero return code before serving any requests.

Because the Helm chart is managed by ArgoCD with automatic synchronization enabled (syncPolicy: automated: prune: true, selfHeal: true), any manual kubectl edit on a Helm-managed resource is detected as drift and reverted within 60–90 seconds. Elliot's fix temporarily resolved the issue but was undone by ArgoCD's self-healing within 90 seconds of the corrected pods starting. Elliot did not know that ArgoCD was managing the deployment or that kubectl edit on Helm-managed resources is ineffective in an auto-sync environment. This information was not part of on-call onboarding.

The escalation delay was compounded by the organizational pattern of posting help requests in #platform-help rather than #incidents for anything that felt uncertain. The #platform-help channel has no on-call rotation monitor; senior engineers check it opportunistically. The #incidents channel, by contrast, triggers PagerDuty escalation after 2 minutes of no IC acknowledgement.

Contributing Factors¶

Runbook for "service fails to start after deploy" did not exist: The on-call runbook library covered application-level errors (database connection failures, downstream timeouts) and infrastructure failures (node OOM, disk full) but had no entry for the most common deploy-related failure mode: a service that exits immediately on startup due to a config error. Had a runbook existed, it would have directed Elliot to helm rollback as the first remediation step for a Helm-managed service in CrashLoopBackOff following a deploy.
On-call onboarding did not include a simulated incident exercise: Elliot had completed the written on-call onboarding checklist (reading runbooks, shadowing two on-call rotations) but had never worked through a hands-on incident simulation. The company had a documented "chaos training" program for on-call engineers, but participation was optional and Elliot had not yet attended. The distinction between kubectl-managed resources and Helm/ArgoCD-managed resources — a critical operational concept — was not covered in the written material.
The escalation channel (#platform-help) was buried in the Slack channel list: Elliot's instinct to seek help was correct, but the channel chosen (#platform-help) was not monitored by on-call engineers and had no PagerDuty integration. The correct channel (#incidents) was not prominently linked in the on-call handbook. The 10-minute gap between Elliot's help request (10:04) and Callum joining (10:14) was due entirely to the wrong channel choice.

What We Got Lucky About¶

The auth-service had a 60-second session token renewal grace period implemented specifically for rolling deploy scenarios. This meant that existing authenticated users did not experience failures for the first 30 minutes of the incident. Had this grace period not existed, the customer impact would have been felt within seconds of the CrashLoopBackOff beginning rather than gradually after session expiry.
Callum Frost — the engineer who deployed the change — was actively monitoring Slack and saw Elliot's message in #platform-help within 10 minutes. A deployment author who was in meetings or off-shift would have extended the diagnosis phase considerably, as Elliot had not yet connected the CrashLoopBackOff to the recent deploy.

Detection¶

How We Detected¶

PagerDuty alert on auth-service pod restart count exceeding threshold (4 restarts in 5 minutes, 09:44 UTC). The alert fired 3 minutes after the deployment completed and pods began crash-looping. Detection was fast and automated.

Why We Didn't Detect Sooner¶

Detection was appropriate for the failure mode. The 4-minute window (09:41 deploy → 09:44 alert) reflects the time for pods to enter CrashLoopBackOff and accumulate restart counts past the threshold. A faster detection path would be a deploy hook that runs a smoke test immediately after rollout and pages if the smoke test fails — this would have detected the startup failure within 60 seconds of the deploy completing. No post-deploy smoke test exists for auth-service.

Response¶

What Went Well¶

Elliot's initial diagnosis was correct: he identified the malformed ConfigMap within 3 minutes of acknowledging the alert by reading pod logs. The diagnosis skills were there; the remediation knowledge was not.
Callum's decision to seek IC authorization before running helm rollback on a production release showed good discipline about blast radius. The 2-minute IC page delay was a worthwhile tradeoff for the assurance that the rollback was sanctioned.
The 60-second grace period in the auth service materially limited customer impact, demonstrating the value of designing auth systems with deploy-safe session renewal windows.

What Went Poorly¶

Thirty-five minutes elapsed between detection and escalation to a senior engineer. On-call engineers — especially those on their first solo rotation — should have a defined "escalate if not resolved in X minutes" rule embedded in the runbook. The current on-call handbook says "escalate when needed" without a time bound.
The ArgoCD self-healing behavior was completely opaque to Elliot. When his fix "undid itself," he had no framework for understanding why. This class of confusion — "I fixed it and it broke again" — should have its own runbook entry.
The #platform-help vs. #incidents distinction was not clearly communicated in on-call materials. Elliot used the wrong channel for 10 minutes. The on-call handbook should prominently state: "For active incidents, use #incidents only. #platform-help is not monitored during incidents."

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
AI-010-01	Write runbook: "Service CrashLoopBackOff After Deploy" covering: read logs, identify config vs code error, determine if Helm-managed (check `helm list`), run `helm rollback`, verify ArgoCD sync state	P0	Identity Platform / Platform Eng	Open	2025-07-15
AI-010-02	Add mandatory chaos training simulation to on-call onboarding; engineers must complete a simulated CrashLoopBackOff scenario before first solo rotation	P0	Incident Command	Open	2025-07-25
AI-010-03	Add to on-call handbook (prominent callout box): escalate to senior engineer after 15 minutes without a clear mitigation path; do not wait until the situation feels "bad enough"	P1	Incident Command	Open	2025-07-15
AI-010-04	Add post-deploy smoke test for `auth-service`: hit `/healthz` and `/auth/token` with test credentials 60 seconds after rollout; page on-call if either fails	P1	Identity Platform	Open	2025-07-22
AI-010-05	Add YAML linting step to `auth-service` CI pipeline; validate that all values in `values.yaml` parse correctly as their declared types before Helm template rendering	P1	Identity Platform	Open	2025-07-22
AI-010-06	Update on-call handbook: replace "escalate when needed" with explicit time box; add prominent #incidents channel link with note that #platform-help is not monitored during incidents	P2	Incident Command	Open	2025-07-15

Lessons Learned¶

Runbooks are not documentation overhead — they are risk mitigation for your worst moments: A 5-minute Helm rollback took 53 minutes because there was no runbook. The knowledge existed in the organization; it was simply not written down in a place accessible during a high-stress incident. Every Helm-managed production service should have a "fails to start after deploy" runbook entry as a baseline.
ArgoCD self-healing is invisible until you know to look for it: Engineers unfamiliar with GitOps workflows will interpret ArgoCD reverting their changes as "the fix didn't work" and waste time repeating the same broken approach. The operational model of ArgoCD — especially the interaction between kubectl edit and auto-sync — must be part of on-call onboarding, not optional documentation.
Junior on-call engineers need time-boxed escalation norms, not open-ended judgment calls: Asking a first-time solo on-call engineer to self-assess "is this bad enough to escalate?" while simultaneously diagnosing an unfamiliar failure and watching customer impact grow is an unfair cognitive burden. A mandatory escalation after 15 minutes without a mitigation path removes the social friction and protects both the engineer and the customers.

Cross-References¶

Failure Pattern: Process Gap — Missing Runbook; Operational Knowledge Silo; GitOps Drift Revert Confusion
Topic Packs: Kubernetes CrashLoopBackOff Triage; Helm and ArgoCD Operations; On-Call Readiness; Incident Escalation
Runbook: runbooks/deployments/crashloop-after-deploy.md (to be created per AI-010-01); runbooks/deployments/helm-rollback.md; runbooks/incidents/escalation-policy.md
Decision Tree: Pod CrashLoopBackOff → Read --previous logs → Config parse error? → Check if Helm-managed (helm list -n <namespace>) → Yes: helm rollback <release> <revision> → Verify ArgoCD sync is not set to auto-revert → No: Edit Deployment/ConfigMap directly