Postmortem: Custom Controller Missing Backoff Overwhelms API Server¶
| Field | Value |
|---|---|
| ID | PM-015 |
| Date | 2025-11-03 |
| Severity | SEV-2 |
| Duration | 22m (detection to resolution) |
| Time to Detect | 3m |
| Time to Mitigate | 22m |
| Customer Impact | All cluster operations degraded for 19 minutes; automated scaling (HPA) stalled for 19 minutes during a traffic peak; 3 deployments queued and unable to progress |
| Revenue Impact | ~$8,500 estimated (HPA unable to scale during traffic peak; ~12% of requests served at degraded latency) |
| Teams Involved | Database Platform, SRE, Kubernetes Infrastructure |
| Postmortem Author | Lindiwe Dlamini |
| Postmortem Date | 2025-11-07 |
Executive Summary¶
On November 3, 2025, the db-provisioner Kubernetes operator — an in-house controller that manages database provisioning resources — entered a tight retry loop after an RBAC role binding expired, causing a permissions error on every reconcile attempt. Because the controller was written without using controller-runtime's built-in rate-limiting work queue, each failed reconcile immediately re-queued the same object and retried, generating approximately 2,000 API server requests per second from a single controller pod. Kubernetes API server latency climbed from a baseline of 50ms to over 8 seconds within 90 seconds. All cluster control-plane operations were affected: kubectl commands timed out, deployments could not progress, and the Horizontal Pod Autoscaler was unable to fetch metrics or scale workloads. The API server did not crash — it degraded gracefully, prioritizing high-priority requests (node heartbeats) over lower-priority ones. The incident was resolved by scaling the db-provisioner deployment to zero replicas to stop the flood, followed by renewing the expired RBAC binding and redeploying the controller.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 16:22:00 | RBAC role binding db-provisioner-crb reaches its expirationSeconds TTL (set to 90 days at creation; binding was created 2025-08-05); binding is automatically deleted by the Kubernetes token controller |
| 16:22:05 | db-provisioner controller attempts reconcile of DatabaseCluster/prod-analytics-db; API call to list DatabaseCluster resources returns 403 Forbidden (binding deleted) |
| 16:22:06 | Controller reconcile function returns error; work queue re-enqueues object immediately (no RateLimitingInterface configured; plain workqueue.NewFIFO() used) |
| 16:22:06 | Controller begins tight loop: reconcile → 403 → enqueue → reconcile at approximately 600 req/s |
| 16:22:30 | API server request rate from db-provisioner service account climbs to 1,800 req/s; API server CPU at 68% (normal: 15%) |
| 16:22:50 | API server p99 latency crosses 2 seconds; kubectl get pods commands begin timing out for engineers |
| 16:23:00 | API server p99 latency reaches 4.1 seconds; HPA controller fails to fetch metrics from metrics-server via API server |
| 16:23:15 | PagerDuty alert fires: "kube-apiserver p99 latency > 3s for 60 seconds" |
| 16:23:20 | SRE on-call Rashid Al-Farsi acknowledges; kubectl unusable from laptop |
| 16:24:00 | API server p99 latency peaks at 8.3 seconds; API Priority and Fairness (APF) flow control begins shedding low-priority requests |
| 16:24:10 | Rashid identifies db-provisioner as top API request source via kube-apiserver audit log (apiserver_request_total metric by user) |
| 16:25:00 | Rashid pages Database Platform on-call Mei-Ling Chow; simultaneously attempts kubectl scale deployment db-provisioner --replicas=0 |
| 16:25:20 | First kubectl scale attempt times out (API server latency); Rashid retries |
| 16:25:55 | Second kubectl scale attempt succeeds via a lucky low-latency window; db-provisioner deployment scaled to 0 |
| 16:26:00 | API server request rate drops to baseline; latency begins recovering |
| 16:27:30 | API server p99 latency returns to 55ms; all kubectl operations responsive |
| 16:28:00 | HPA resumes metric collection; begins scaling workloads to meet traffic demand |
| 16:29:00 | Mei-Ling joins war room; confirms RBAC binding expiration as root cause from controller logs |
| 16:33:00 | New RBAC role binding db-provisioner-crb-v2 created without expirationSeconds TTL (pending permanent fix) |
| 16:41:00 | db-provisioner redeployed at 1 replica with temporary fix: exponential backoff added via workqueue.NewRateLimitingQueue(workqueue.DefaultControllerRateLimiter()) |
| 16:44:00 | Controller reconciles successfully with renewed binding; all DatabaseCluster resources reconciled; incident declared resolved |
| 17:15:00 | Kubernetes Infrastructure team confirms no node NotReady events occurred during the incident (node heartbeats maintained priority throughout) |
Impact¶
Customer Impact¶
No direct customer-facing service outage occurred. The impact was mediated through the Kubernetes control plane: the Horizontal Pod Autoscaler was unable to scale for 19 minutes during a traffic peak that began at approximately 16:20. During this window, the affected services handled traffic at their pre-scale replica count, resulting in approximately 12% of requests being served at elevated latency (p99 climbed from 280ms to ~1.4 seconds). No requests were dropped. Three deployments that were initiated by other teams during the incident were unable to progress and remained in a Progressing state until the API server recovered.
Internal Impact¶
- SRE: 2 engineers × 1.5 hours = 3 engineering-hours
- Database Platform: 2 engineers × 2 hours = 4 engineering-hours (root cause analysis, RBAC fix, controller fix)
- Kubernetes Infrastructure: 1 engineer × 1 hour = 1 engineering-hour (APF analysis, node health verification)
- 3 engineering teams had deployments blocked for 19–22 minutes; one deployment was to roll back a production bug and had to wait during the incident
- Post-incident controller audit: ~8 additional engineering-hours
Data Impact¶
No data was lost or corrupted. The db-provisioner controller manages provisioning metadata (Kubernetes custom resources), not database data. The 403 error meant no provisioning operations were attempted, so no database state changes were made during the incident.
Root Cause¶
What Happened (Technical)¶
The db-provisioner operator was written approximately 18 months ago by the Database Platform team as an internal tool to manage DatabaseCluster and DatabaseUser custom resources. The controller was implemented using a raw client-go work queue (workqueue.NewFIFO()) rather than the controller-runtime framework's RateLimitingInterface. The controller-runtime framework provides a DefaultControllerRateLimiter() that implements exponential backoff with a base delay of 5ms and a maximum delay of 1000 seconds, along with a per-item rate limiter. The raw FIFO queue has no rate limiting at all.
The db-provisioner service account uses a ClusterRoleBinding that was created with expirationSeconds: 7776000 (90 days) as a security measure recommended during an access review. The binding was created on August 5, 2025, and expired at exactly 16:22:00 UTC on November 3, 2025. The expiration was not tracked in any monitoring system or calendar alert.
When the binding expired, the Kubernetes token controller automatically deleted it. The controller's reconcile function immediately received a 403 on its first API call (listing DatabaseCluster resources). The error handler returned the error to the work queue, which re-queued the item immediately. With no rate limiting, the controller cycled through reconcile-fail-requeue at approximately 600 times per second per reconcile goroutine. The controller runs with a MaxConcurrentReconciles of 3, so the effective request rate was approximately 1,800 req/s, climbing to ~2,000 req/s as the queue backed up with all managed objects.
The Kubernetes API server's API Priority and Fairness (APF) configuration did not have a dedicated flow schema or priority level for custom controllers. The db-provisioner service account's requests were classified under the default workload-high priority level, which shares a queue with HPA, node controller, and deployment controller traffic. The flood of 403 responses from db-provisioner consumed APF concurrency slots that should have been available for other controllers, causing HPA and deployment controller operations to queue and time out.
Contributing Factors¶
-
Controller implemented without
controller-runtimerate limiting: Thecontroller-runtimelibrary's rate-limiting work queue is specifically designed to prevent runaway retry loops. Using a rawworkqueue.NewFIFO()is a direct violation of Kubernetes controller best practices documented in the upstream controller-runtime README and the Kubernetes contributor guide. No code review checklist item existed to catch this pattern during the controller's original development or subsequent PRs. -
RBAC binding TTL not tracked or alerted on: The
expirationSecondsfield on theClusterRoleBindingwas set intentionally as a security control, but no mechanism was put in place to alert the team before expiration. A RBAC binding that expires silently causes an immediate permissions failure in any controller that depends on it. The TTL should have been tracked as a recurring calendar item, a Prometheus alert on binding age, or implemented as a regularly rotated secret managed by an automated process. -
No APF configuration for custom controllers: The cluster's API Priority and Fairness configuration uses Kubernetes defaults. The default configuration does not isolate custom controller traffic from core controller traffic. A dedicated
FlowSchemaandPriorityLevelConfigurationfor custom controllers (e.g., at a level belowworkload-high) would have contained the flood of 403 requests to a separate priority bucket, allowing HPA, node controller, and deployment controller operations to continue unaffected.
What We Got Lucky About¶
- The API server did not crash. Kubernetes API server's internal rate limiting and APF flow control allowed it to degrade gracefully under the 2,000 req/s flood. Node heartbeats (which have system-level priority in APF) continued to succeed throughout the incident, preventing any nodes from transitioning to
NotReady— which would have triggered node eviction and could have turned a control-plane degradation into a data-plane outage with pod evictions. - Rashid was able to issue the
kubectl scale --replicas=0command successfully on the second attempt during a transient low-latency window. If the API server had been completely unresponsive, the mitigation path would have required either direct etcd manipulation to delete the deployment (dangerous and complex) or a node-levelcrictlcommand to kill the controller container — both significantly harder and riskier operations.
Detection¶
How We Detected¶
The incident was detected by a PagerDuty alert on kube-apiserver p99 request latency exceeding 3 seconds for 60 consecutive seconds. This alert fired at 16:23:15, approximately 70 seconds after the controller began its retry loop. The metric is computed from apiserver_request_duration_seconds in Prometheus and is one of the cluster's top-5 golden signals.
Why We Didn't Detect Sooner¶
Detection at T+70s is relatively fast for this class of incident. The 3-second threshold with a 60-second evaluation window is the correct balance for this metric — shorter windows produce too many false positives from transient spikes. A faster detection path would be a rate alert on the db-provisioner service account's API request volume (e.g., >100 req/s from a single service account), but no such alert existed. The per-service-account request rate is available in the API server audit log but was not being aggregated into a Prometheus metric.
Response¶
What Went Well¶
- Rashid identified
db-provisioneras the source of the API flood within 55 seconds of acknowledging the alert, using theapiserver_request_totalmetric broken down byuserlabel — this is a pre-built panel on the cluster health dashboard that proved exactly useful in this scenario. - The mitigation action (scale controller to zero) was the correct minimal intervention — it stopped the flood immediately without requiring a code change, a restart of the API server, or any action on the data plane.
- Node heartbeats continued throughout, meaning no secondary incidents (pod evictions, node NotReady alerts) were triggered. The team could focus entirely on the API server flood without managing cascading node failures.
What Went Poorly¶
- The first
kubectl scalecommand timed out because the API server was itself under load. The team had no documented backup path for "how to stop a misbehaving controller when the API server is too degraded to acceptkubectlcommands." Direct etcd manipulation orcrictlcontainer kill should be documented in the runbook as escalation options. - The RBAC binding expiration was completely invisible to the team until it failed. The
expirationSecondsfield is not surfaced in Kubernetes resource listing by default — you have to inspect the full YAML. A weekly audit or Prometheus alert on RBAC binding age would have given a 1-week warning. - The post-incident audit of other custom controllers revealed that 2 additional controllers in the cluster also use raw FIFO queues without rate limiting. These were not discovered until after the postmortem was written, which means the cluster remained at risk for 4 days after the incident.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| PM-015-01 | Replace workqueue.NewFIFO() with workqueue.NewRateLimitingQueue(workqueue.DefaultControllerRateLimiter()) in db-provisioner and all other in-house controllers; add to controller code review checklist |
P0 | Database Platform (Mei-Ling Chow) | Done (db-provisioner); In Progress (others) | 2025-11-10 |
| PM-015-02 | Audit all in-house Kubernetes controllers for raw work queue usage; fix any identified before 2025-11-10 | P0 | Kubernetes Infrastructure (Lindiwe Dlamini) | In Progress | 2025-11-10 |
| PM-015-03 | Create FlowSchema and PriorityLevelConfiguration for custom controllers at priority below workload-high; assign all in-house controller service accounts to this flow schema |
P1 | Kubernetes Infrastructure (Lindiwe Dlamini) | In Progress | 2025-11-17 |
| PM-015-04 | Add Prometheus alert on per-service-account API request rate: fire if any single service account exceeds 100 req/s for 30 seconds | P1 | SRE (Rashid Al-Farsi) | Planned | 2025-11-14 |
| PM-015-05 | Implement RBAC binding expiration monitoring: Prometheus alert firing 7 days before any ClusterRoleBinding with expirationSeconds expires; run at cluster-wide scrape |
P1 | Kubernetes Infrastructure (Lindiwe Dlamini) | Planned | 2025-11-17 |
| PM-015-06 | Add runbook section: "API server too degraded to accept kubectl — emergency controller stop via crictl"; document node SSH path and container name lookup | P2 | SRE (Rashid Al-Farsi) | Planned | 2025-11-14 |
Lessons Learned¶
-
Rate limiting in a Kubernetes controller is not optional — it is the mechanism that prevents the controller from becoming a denial-of-service source: The
controller-runtimelibrary's rate-limiting work queue exists precisely because retry-on-error without backoff is a known failure mode in distributed systems. Writing a controller without it is equivalent to writing an HTTP client that retries on 5xx with no backoff. The pattern should be enforced at code review time via checklist, linting, or static analysis. -
Security controls with expiration dates must be paired with pre-expiration alerts: A TTL on an RBAC binding is a good security practice. A TTL with no monitoring is a time bomb. Every time-bounded security control — RBAC bindings, TLS certificates, API keys, service account tokens — must have an associated alert that fires far enough in advance (days to weeks) to allow planned renewal without an incident.
-
API Priority and Fairness is a cluster-level blast radius control that requires active configuration: Kubernetes APF with default settings does not protect core controllers from misbehaving custom controllers. Dedicating a lower-priority APF level to custom and third-party controllers is a cheap, effective way to ensure that a runaway custom controller cannot impair HPA, node management, or deployment rollouts.
Cross-References¶
- Failure Pattern: Retry Storm Without Backoff; Missing Security Control Lifecycle Management; Shared Resource Exhaustion
- Topic Packs:
kubernetes-controllers(controller-runtime, work queues, rate limiting, reconcile loops),kubernetes-rbac(ClusterRoleBinding, expiration, service accounts),kubernetes-apiserver(API Priority and Fairness, flow schemas, audit logs) - Runbook:
runbook-apiserver-high-latency.md,runbook-controller-runaway.md - Decision Tree: API server latency spike → identify top API request source by service account (apiserver_request_total by user) → if single service account dominating, scale controller to zero → identify root cause from controller logs → fix and redeploy with rate limiting confirmed