Postmortem: Resource Quota Misconfiguration Blocks All Deployments¶

Field	Value
ID	PM-017
Date	2025-06-03
Severity	SEV-3
Duration	2h 10m (detection to resolution)
Time to Detect	25m
Time to Mitigate	1h 45m
Customer Impact	None — existing Pods were unaffected; only new deployments and rollouts failed
Revenue Impact	None
Teams Involved	Platform Engineering, Backend Engineering, Data Engineering, SRE On-Call
Postmortem Author	Marcus Osei
Postmortem Date	2025-06-06

Executive Summary¶

On 2025-06-03 starting at 09:41 UTC, the Platform Engineering team applied a batch of 15 namespace resource quota changes as part of a cost optimization initiative. One quota — a CPU request hard limit of 500m per pod — was misconfigured: it was intended to be a default request value (a LimitRange entry), not a hard namespace quota cap. All new Pod creations and Deployment rollouts in the affected namespaces immediately began failing with exceeded quota: requests.cpu errors because most services specify CPU requests of 1000m or higher. Existing running Pods were unaffected. Over the following 45 minutes, 9 development teams filed support tickets as their CI/CD pipelines began failing on deploy steps. The root cause was traced to the batch quota change, the misconfigured ResourceQuota object was patched, and all pending deployments were re-triggered successfully by 11:51 UTC.

Timeline (All times UTC)¶

Time	Event
09:41	Platform Engineering applies batch of 15 `ResourceQuota` and `LimitRange` changes across 22 namespaces via `kubectl apply -f quotas/` as part of Project Frugal (cost optimization initiative)
09:42	Change completes with no errors reported by kubectl; platform engineer Saoirse Flanagan marks the change as applied in the Jira ticket
09:47	First CI/CD pipeline failure observed: `payments-api` deploy job fails in `prod-payments` namespace with `Error creating: pods "payments-api-7d9f8b-xxx" is forbidden: exceeded quota: requests.cpu`
09:55	`data-pipeline` team's nightly backfill deploy fails in `prod-data` namespace; engineer assumes it is a flaky cluster issue and re-triggers
10:06	Re-triggered `data-pipeline` deploy also fails; engineer Yusuf Adebisi opens Slack thread in `#platform-help`: "deploys broken, quota errors, anyone else?"
10:09	Three more teams report in the thread within 3 minutes; SRE on-call Marcus Osei joins the thread
10:12	Marcus runs `kubectl get resourcequota -A` and sees the new quotas applied across all namespaces; cross-references timestamps with Jira
10:15	Marcus identifies the misconfigured quota: `cpu-per-pod` hard limit of `500m` in the `ResourceQuota` spec; notes that this is far below the `1000m` requests of nearly all services
10:17	Marcus pages Saoirse; Saoirse joins and confirms the intent was a `LimitRange` default, not a `ResourceQuota` hard cap
10:22	Saoirse prepares patch: changes `ResourceQuota` spec to remove the `requests.cpu` per-pod entry; adds equivalent `LimitRange` defaultRequest entries
10:35	Patch is peer-reviewed by Platform Engineering lead Raj Thirumurthy; approved
10:38	Saoirse applies patch to all 22 affected namespaces via corrected manifest
10:41	Test deploy of `payments-api` succeeds in `prod-payments`; quota no longer blocking
10:44	Platform team posts in `#platform-help` and `#incidents`: quotas patched, teams should re-trigger failed deploys
11:15	Majority of failed deployments re-triggered and completed; Data Engineering manually re-triggers their backfill jobs
11:51	Last confirmed failed deployment re-triggered successfully; incident declared resolved
12:10	Postmortem scheduled for 2025-06-06

Impact¶

Customer Impact¶

None. Kubernetes ResourceQuota enforcement applies only to new Pod admission requests. Pods that were already running at the time the quota was applied continued running without interruption. All customer-facing services had already completed their most recent rollout before 09:41, so no customer traffic was affected. The impact was entirely confined to the CD pipeline layer.

Internal Impact¶

9 development teams experienced failed deployments, totaling an estimated 15–20 failed pipeline runs across CI/CD systems
Approximately 4 hours of aggregate engineering time lost across affected teams (debugging, re-triggering, waiting)
Saoirse Flanagan (Platform Engineering): ~2 hours unplanned incident response
Marcus Osei (SRE On-Call): ~2.5 hours including coordination and postmortem scheduling
Raj Thirumurthy (Platform Engineering Lead): ~45 minutes for patch review and incident coordination
Data Engineering's nightly backfill job ran approximately 2 hours late, shifting a dependent report delivery by one business day
Project Frugal milestone delayed: remaining quota changes deprioritized pending a pre-flight validation process being put in place

Data Impact¶

None. No data was lost or corrupted. The Data Engineering backfill delay resulted in a stale report but no data loss; the backfill ran to completion once unblocked.

Root Cause¶

What Happened (Technical)¶

The Project Frugal cost optimization initiative aimed to establish baseline resource governance across all production and staging namespaces. The plan called for two types of Kubernetes objects: ResourceQuota (namespace-level aggregate limits) and LimitRange (per-container defaults and bounds). The confusion arose between these two object types.

The intent for CPU per-pod controls was to set a LimitRange defaultRequest of 500m — meaning containers that do not specify a CPU request would inherit 500m as their request. This is a sensible default that prevents unbounded resource consumption by services that forgot to set requests. However, the manifest was written as a ResourceQuota hard entry with key requests.cpu scoped to individual pods, which Kubernetes interprets as: no single Pod in this namespace may request more than 500m of CPU in total across all its containers.

Because nearly every production service specifies explicit CPU requests between 1000m and 4000m, every new Pod admission was immediately rejected by the Kubernetes API server with a forbidden: exceeded quota error. The kubectl apply of the batch manifest returned exit code 0 because the objects were syntactically valid and were accepted by the API server — the enforcement only happens at Pod admission time, not at quota-creation time.

The batch nature of the change (15 manifests applied together via kubectl apply -f quotas/) meant that when failures were first reported 6 minutes later, there was no single obvious "what changed" event visible to the affected teams. The error message (exceeded quota: requests.cpu) pointed to the quota system but did not specify which quota object was responsible or when it had been created.

The root cause is therefore twofold: a manifest authoring error (wrong object type for the intended policy) combined with a missing pre-flight validation step that would have checked whether existing workloads satisfied the proposed new quotas before applying them.

Contributing Factors¶

Batch change application obscures causal chain: Applying 15 quota changes simultaneously made it harder to isolate which specific change caused the failures. If changes had been applied one namespace at a time with a validation pause between each, the error would have been caught at the first namespace within minutes.
No pre-flight workload compatibility check: Before applying new ResourceQuota objects, there is no tooling or process step that simulates the quota against currently running workloads and pending deployments to verify compatibility. Such a check would have immediately flagged that payments-api (requesting 2000m CPU) would be blocked by a 500m cap.
Change not communicated to development teams: The Project Frugal quota changes were not announced in #engineering or team-specific channels before being applied. Development teams had no context when their deploys began failing, leading to wasted debugging time before the root cause was identified centrally.

What We Got Lucky About¶

Existing running Pods were completely unaffected. Kubernetes quota enforcement is admission-time only, meaning the customer-facing services that were already running never experienced the constraint. If this had been a LimitRange that retroactively applied to running containers (which LimitRange changes do not — but a misconfigured PodDisruptionBudget or NetworkPolicy change could), the impact would have been customer-visible.
The quota error message, while confusing to the affected teams initially, was deterministic and reproducible. Every failed deploy produced the same error, which made it easy for Marcus to recognize the pattern across 9 separate support tickets within 3 minutes of joining the #platform-help thread. A non-deterministic or intermittent failure would have taken far longer to aggregate into a single root cause.

Detection¶

How We Detected¶

Detection was driven by developer reports in #platform-help, not by automated alerting. The first automated signal was a CI/CD pipeline failure notification in the payments-api Slack channel at 09:47, but it was treated as a transient error initially. The accumulation of multiple teams reporting the same error in #platform-help starting at 10:06 prompted Marcus to investigate at the infrastructure level.

Why We Didn't Detect Sooner¶

No automated alerting exists for failed Pod admission events at the namespace or cluster level. A metric like kube_pod_failed_admission_total labeled by rejection reason would have allowed an alert to fire within seconds of the first rejection at 09:47. Additionally, because the platform change was not announced, affected teams had no immediate context to connect their pipeline failures to a recent infrastructure change — they assumed a flaky cluster and re-tried before escalating.

Response¶

What Went Well¶

Once Marcus identified the quota change as the likely cause (10:12), the path from hypothesis to confirmation was fast — a single kubectl describe resourcequota call confirmed the problematic entry.
The peer-review step before applying the patch (Raj reviewing Saoirse's corrected manifest) caught a secondary issue in the patch draft where one namespace's LimitRange had an incorrect max value; this was fixed before application.
The #platform-help thread served as an effective aggregation point — 9 teams self-reporting in the same channel gave Marcus the signal needed to escalate quickly.

What Went Poorly¶

The batch application of 15 changes with no per-change validation or rollout strategy was the proximate process failure. There was no concept of "apply one, verify, apply next."
No automated alert fired. For nearly 30 minutes, detection depended entirely on developers noticing their own broken deploys and choosing to report them rather than silently retry.
Affected teams were not proactively notified when the patch was being prepared. Engineers continued debugging locally for 20+ minutes after Marcus had already identified the root cause.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
AI-017-01	Build pre-flight quota validation script: simulate proposed `ResourceQuota` changes against all running workloads in the target namespace and fail if any running Pod or active Deployment would be blocked	High	Saoirse Flanagan	Open	2025-06-20
AI-017-02	Add cluster-level alert for `Pod admission failures due to quota` — alert if more than 5 quota-rejected Pod creations occur in any namespace within a 5-minute window	High	Marcus Osei	Open	2025-06-17
AI-017-03	Update Project Frugal rollout plan: apply quota changes one namespace at a time with a 10-minute observation window between each; require explicit sign-off from a namespace owner	High	Raj Thirumurthy	Open	2025-06-10
AI-017-04	Add mandatory change announcement step to Platform Engineering change process: infrastructure changes affecting dev team workflows must be posted in `#platform-changes` at least 1 hour before application	Medium	Marcus Osei	Open	2025-06-13
AI-017-05	Create internal wiki page distinguishing `ResourceQuota` vs. `LimitRange` semantics with worked examples; add link to Platform Engineering onboarding docs	Medium	Saoirse Flanagan	Open	2025-06-27
AI-017-06	Evaluate adopting `kubectl diff` as a mandatory step in all quota change runbooks to make the diff between current and proposed state explicit before applying	Low	Raj Thirumurthy	Open	2025-06-30

Lessons Learned¶

Kubernetes quota semantics are subtle and consequential: ResourceQuota and LimitRange serve different purposes and have different enforcement timing. Engineers applying resource governance changes must understand that ResourceQuota hard limits apply at Pod admission time and will immediately block all new Pods, including Deployment rollouts, Job runs, and CI/CD deploy steps — not just future scheduling.
Batch infrastructure changes need incremental rollout discipline: Applying 15 changes simultaneously provides no opportunity to catch a misconfiguration before it has propagated everywhere. The marginal speed gain of a batch apply does not outweigh the cost of a blast radius that spans 22 namespaces simultaneously.
Admission-time failures need their own alerting category: Most cluster health alerts focus on running workload health (Pod restarts, OOMKills, node pressure). Admission-time failures are invisible to those alerts because no Pod ever enters a running state. Explicitly monitoring for admission rejections closes a blind spot that affects all teams using the cluster for deployments.

Cross-References¶

Failure Pattern: Configuration error / semantic misunderstanding; batch change without incremental validation
Topic Packs: Kubernetes resource management, LimitRange vs. ResourceQuota, admission control, cost optimization
Runbook: runbooks/kubernetes/resource-quota-changes.md
Decision Tree: Triage → Deployment failures with exceeded quota → kubectl get resourcequota -n <ns> → check recent quota changes in git/Jira → patch or revert → re-trigger affected deployments