Postmortem: Resource Quota Misconfiguration Blocks All Deployments¶
| Field | Value |
|---|---|
| ID | PM-017 |
| Date | 2025-06-03 |
| Severity | SEV-3 |
| Duration | 2h 10m (detection to resolution) |
| Time to Detect | 25m |
| Time to Mitigate | 1h 45m |
| Customer Impact | None — existing Pods were unaffected; only new deployments and rollouts failed |
| Revenue Impact | None |
| Teams Involved | Platform Engineering, Backend Engineering, Data Engineering, SRE On-Call |
| Postmortem Author | Marcus Osei |
| Postmortem Date | 2025-06-06 |
Executive Summary¶
On 2025-06-03 starting at 09:41 UTC, the Platform Engineering team applied a batch of 15 namespace resource quota changes as part of a cost optimization initiative. One quota — a CPU request hard limit of 500m per pod — was misconfigured: it was intended to be a default request value (a LimitRange entry), not a hard namespace quota cap. All new Pod creations and Deployment rollouts in the affected namespaces immediately began failing with exceeded quota: requests.cpu errors because most services specify CPU requests of 1000m or higher. Existing running Pods were unaffected. Over the following 45 minutes, 9 development teams filed support tickets as their CI/CD pipelines began failing on deploy steps. The root cause was traced to the batch quota change, the misconfigured ResourceQuota object was patched, and all pending deployments were re-triggered successfully by 11:51 UTC.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 09:41 | Platform Engineering applies batch of 15 ResourceQuota and LimitRange changes across 22 namespaces via kubectl apply -f quotas/ as part of Project Frugal (cost optimization initiative) |
| 09:42 | Change completes with no errors reported by kubectl; platform engineer Saoirse Flanagan marks the change as applied in the Jira ticket |
| 09:47 | First CI/CD pipeline failure observed: payments-api deploy job fails in prod-payments namespace with Error creating: pods "payments-api-7d9f8b-xxx" is forbidden: exceeded quota: requests.cpu |
| 09:55 | data-pipeline team's nightly backfill deploy fails in prod-data namespace; engineer assumes it is a flaky cluster issue and re-triggers |
| 10:06 | Re-triggered data-pipeline deploy also fails; engineer Yusuf Adebisi opens Slack thread in #platform-help: "deploys broken, quota errors, anyone else?" |
| 10:09 | Three more teams report in the thread within 3 minutes; SRE on-call Marcus Osei joins the thread |
| 10:12 | Marcus runs kubectl get resourcequota -A and sees the new quotas applied across all namespaces; cross-references timestamps with Jira |
| 10:15 | Marcus identifies the misconfigured quota: cpu-per-pod hard limit of 500m in the ResourceQuota spec; notes that this is far below the 1000m requests of nearly all services |
| 10:17 | Marcus pages Saoirse; Saoirse joins and confirms the intent was a LimitRange default, not a ResourceQuota hard cap |
| 10:22 | Saoirse prepares patch: changes ResourceQuota spec to remove the requests.cpu per-pod entry; adds equivalent LimitRange defaultRequest entries |
| 10:35 | Patch is peer-reviewed by Platform Engineering lead Raj Thirumurthy; approved |
| 10:38 | Saoirse applies patch to all 22 affected namespaces via corrected manifest |
| 10:41 | Test deploy of payments-api succeeds in prod-payments; quota no longer blocking |
| 10:44 | Platform team posts in #platform-help and #incidents: quotas patched, teams should re-trigger failed deploys |
| 11:15 | Majority of failed deployments re-triggered and completed; Data Engineering manually re-triggers their backfill jobs |
| 11:51 | Last confirmed failed deployment re-triggered successfully; incident declared resolved |
| 12:10 | Postmortem scheduled for 2025-06-06 |
Impact¶
Customer Impact¶
None. Kubernetes ResourceQuota enforcement applies only to new Pod admission requests. Pods that were already running at the time the quota was applied continued running without interruption. All customer-facing services had already completed their most recent rollout before 09:41, so no customer traffic was affected. The impact was entirely confined to the CD pipeline layer.
Internal Impact¶
- 9 development teams experienced failed deployments, totaling an estimated 15–20 failed pipeline runs across CI/CD systems
- Approximately 4 hours of aggregate engineering time lost across affected teams (debugging, re-triggering, waiting)
- Saoirse Flanagan (Platform Engineering): ~2 hours unplanned incident response
- Marcus Osei (SRE On-Call): ~2.5 hours including coordination and postmortem scheduling
- Raj Thirumurthy (Platform Engineering Lead): ~45 minutes for patch review and incident coordination
- Data Engineering's nightly backfill job ran approximately 2 hours late, shifting a dependent report delivery by one business day
- Project Frugal milestone delayed: remaining quota changes deprioritized pending a pre-flight validation process being put in place
Data Impact¶
None. No data was lost or corrupted. The Data Engineering backfill delay resulted in a stale report but no data loss; the backfill ran to completion once unblocked.
Root Cause¶
What Happened (Technical)¶
The Project Frugal cost optimization initiative aimed to establish baseline resource governance across all production and staging namespaces. The plan called for two types of Kubernetes objects: ResourceQuota (namespace-level aggregate limits) and LimitRange (per-container defaults and bounds). The confusion arose between these two object types.
The intent for CPU per-pod controls was to set a LimitRange defaultRequest of 500m — meaning containers that do not specify a CPU request would inherit 500m as their request. This is a sensible default that prevents unbounded resource consumption by services that forgot to set requests. However, the manifest was written as a ResourceQuota hard entry with key requests.cpu scoped to individual pods, which Kubernetes interprets as: no single Pod in this namespace may request more than 500m of CPU in total across all its containers.
Because nearly every production service specifies explicit CPU requests between 1000m and 4000m, every new Pod admission was immediately rejected by the Kubernetes API server with a forbidden: exceeded quota error. The kubectl apply of the batch manifest returned exit code 0 because the objects were syntactically valid and were accepted by the API server — the enforcement only happens at Pod admission time, not at quota-creation time.
The batch nature of the change (15 manifests applied together via kubectl apply -f quotas/) meant that when failures were first reported 6 minutes later, there was no single obvious "what changed" event visible to the affected teams. The error message (exceeded quota: requests.cpu) pointed to the quota system but did not specify which quota object was responsible or when it had been created.
The root cause is therefore twofold: a manifest authoring error (wrong object type for the intended policy) combined with a missing pre-flight validation step that would have checked whether existing workloads satisfied the proposed new quotas before applying them.
Contributing Factors¶
-
Batch change application obscures causal chain: Applying 15 quota changes simultaneously made it harder to isolate which specific change caused the failures. If changes had been applied one namespace at a time with a validation pause between each, the error would have been caught at the first namespace within minutes.
-
No pre-flight workload compatibility check: Before applying new
ResourceQuotaobjects, there is no tooling or process step that simulates the quota against currently running workloads and pending deployments to verify compatibility. Such a check would have immediately flagged thatpayments-api(requesting 2000m CPU) would be blocked by a 500m cap. -
Change not communicated to development teams: The Project Frugal quota changes were not announced in
#engineeringor team-specific channels before being applied. Development teams had no context when their deploys began failing, leading to wasted debugging time before the root cause was identified centrally.
What We Got Lucky About¶
-
Existing running Pods were completely unaffected. Kubernetes quota enforcement is admission-time only, meaning the customer-facing services that were already running never experienced the constraint. If this had been a
LimitRangethat retroactively applied to running containers (whichLimitRangechanges do not — but a misconfiguredPodDisruptionBudgetorNetworkPolicychange could), the impact would have been customer-visible. -
The quota error message, while confusing to the affected teams initially, was deterministic and reproducible. Every failed deploy produced the same error, which made it easy for Marcus to recognize the pattern across 9 separate support tickets within 3 minutes of joining the
#platform-helpthread. A non-deterministic or intermittent failure would have taken far longer to aggregate into a single root cause.
Detection¶
How We Detected¶
Detection was driven by developer reports in #platform-help, not by automated alerting. The first automated signal was a CI/CD pipeline failure notification in the payments-api Slack channel at 09:47, but it was treated as a transient error initially. The accumulation of multiple teams reporting the same error in #platform-help starting at 10:06 prompted Marcus to investigate at the infrastructure level.
Why We Didn't Detect Sooner¶
No automated alerting exists for failed Pod admission events at the namespace or cluster level. A metric like kube_pod_failed_admission_total labeled by rejection reason would have allowed an alert to fire within seconds of the first rejection at 09:47. Additionally, because the platform change was not announced, affected teams had no immediate context to connect their pipeline failures to a recent infrastructure change — they assumed a flaky cluster and re-tried before escalating.
Response¶
What Went Well¶
- Once Marcus identified the quota change as the likely cause (10:12), the path from hypothesis to confirmation was fast — a single
kubectl describe resourcequotacall confirmed the problematic entry. - The peer-review step before applying the patch (Raj reviewing Saoirse's corrected manifest) caught a secondary issue in the patch draft where one namespace's
LimitRangehad an incorrectmaxvalue; this was fixed before application. - The
#platform-helpthread served as an effective aggregation point — 9 teams self-reporting in the same channel gave Marcus the signal needed to escalate quickly.
What Went Poorly¶
- The batch application of 15 changes with no per-change validation or rollout strategy was the proximate process failure. There was no concept of "apply one, verify, apply next."
- No automated alert fired. For nearly 30 minutes, detection depended entirely on developers noticing their own broken deploys and choosing to report them rather than silently retry.
- Affected teams were not proactively notified when the patch was being prepared. Engineers continued debugging locally for 20+ minutes after Marcus had already identified the root cause.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| AI-017-01 | Build pre-flight quota validation script: simulate proposed ResourceQuota changes against all running workloads in the target namespace and fail if any running Pod or active Deployment would be blocked |
High | Saoirse Flanagan | Open | 2025-06-20 |
| AI-017-02 | Add cluster-level alert for Pod admission failures due to quota — alert if more than 5 quota-rejected Pod creations occur in any namespace within a 5-minute window |
High | Marcus Osei | Open | 2025-06-17 |
| AI-017-03 | Update Project Frugal rollout plan: apply quota changes one namespace at a time with a 10-minute observation window between each; require explicit sign-off from a namespace owner | High | Raj Thirumurthy | Open | 2025-06-10 |
| AI-017-04 | Add mandatory change announcement step to Platform Engineering change process: infrastructure changes affecting dev team workflows must be posted in #platform-changes at least 1 hour before application |
Medium | Marcus Osei | Open | 2025-06-13 |
| AI-017-05 | Create internal wiki page distinguishing ResourceQuota vs. LimitRange semantics with worked examples; add link to Platform Engineering onboarding docs |
Medium | Saoirse Flanagan | Open | 2025-06-27 |
| AI-017-06 | Evaluate adopting kubectl diff as a mandatory step in all quota change runbooks to make the diff between current and proposed state explicit before applying |
Low | Raj Thirumurthy | Open | 2025-06-30 |
Lessons Learned¶
-
Kubernetes quota semantics are subtle and consequential:
ResourceQuotaandLimitRangeserve different purposes and have different enforcement timing. Engineers applying resource governance changes must understand thatResourceQuotahard limits apply at Pod admission time and will immediately block all new Pods, including Deployment rollouts, Job runs, and CI/CD deploy steps — not just future scheduling. -
Batch infrastructure changes need incremental rollout discipline: Applying 15 changes simultaneously provides no opportunity to catch a misconfiguration before it has propagated everywhere. The marginal speed gain of a batch apply does not outweigh the cost of a blast radius that spans 22 namespaces simultaneously.
-
Admission-time failures need their own alerting category: Most cluster health alerts focus on running workload health (Pod restarts, OOMKills, node pressure). Admission-time failures are invisible to those alerts because no Pod ever enters a running state. Explicitly monitoring for admission rejections closes a blind spot that affects all teams using the cluster for deployments.
Cross-References¶
- Failure Pattern: Configuration error / semantic misunderstanding; batch change without incremental validation
- Topic Packs: Kubernetes resource management, LimitRange vs. ResourceQuota, admission control, cost optimization
- Runbook:
runbooks/kubernetes/resource-quota-changes.md - Decision Tree: Triage → Deployment failures with
exceeded quota→kubectl get resourcequota -n <ns>→ check recent quota changes in git/Jira → patch or revert → re-trigger affected deployments