Postmortem: Custom Controller Missing Backoff Overwhelms API Server¶

Field	Value
ID	PM-015
Date	2025-11-03
Severity	SEV-2
Duration	22m (detection to resolution)
Time to Detect	3m
Time to Mitigate	22m
Customer Impact	All cluster operations degraded for 19 minutes; automated scaling (HPA) stalled for 19 minutes during a traffic peak; 3 deployments queued and unable to progress
Revenue Impact	~$8,500 estimated (HPA unable to scale during traffic peak; ~12% of requests served at degraded latency)
Teams Involved	Database Platform, SRE, Kubernetes Infrastructure
Postmortem Author	Lindiwe Dlamini
Postmortem Date	2025-11-07

Executive Summary¶

On November 3, 2025, the db-provisioner Kubernetes operator — an in-house controller that manages database provisioning resources — entered a tight retry loop after an RBAC role binding expired, causing a permissions error on every reconcile attempt. Because the controller was written without using controller-runtime's built-in rate-limiting work queue, each failed reconcile immediately re-queued the same object and retried, generating approximately 2,000 API server requests per second from a single controller pod. Kubernetes API server latency climbed from a baseline of 50ms to over 8 seconds within 90 seconds. All cluster control-plane operations were affected: kubectl commands timed out, deployments could not progress, and the Horizontal Pod Autoscaler was unable to fetch metrics or scale workloads. The API server did not crash — it degraded gracefully, prioritizing high-priority requests (node heartbeats) over lower-priority ones. The incident was resolved by scaling the db-provisioner deployment to zero replicas to stop the flood, followed by renewing the expired RBAC binding and redeploying the controller.

Timeline (All times UTC)¶

Time	Event
16:22:00	RBAC role binding `db-provisioner-crb` reaches its `expirationSeconds` TTL (set to 90 days at creation; binding was created 2025-08-05); binding is automatically deleted by the Kubernetes token controller
16:22:05	`db-provisioner` controller attempts reconcile of `DatabaseCluster/prod-analytics-db`; API call to list `DatabaseCluster` resources returns `403 Forbidden` (binding deleted)
16:22:06	Controller reconcile function returns error; work queue re-enqueues object immediately (no `RateLimitingInterface` configured; plain `workqueue.NewFIFO()` used)
16:22:06	Controller begins tight loop: reconcile → 403 → enqueue → reconcile at approximately 600 req/s
16:22:30	API server request rate from `db-provisioner` service account climbs to 1,800 req/s; API server CPU at 68% (normal: 15%)
16:22:50	API server p99 latency crosses 2 seconds; `kubectl get pods` commands begin timing out for engineers
16:23:00	API server p99 latency reaches 4.1 seconds; HPA controller fails to fetch metrics from metrics-server via API server
16:23:15	PagerDuty alert fires: "kube-apiserver p99 latency > 3s for 60 seconds"
16:23:20	SRE on-call Rashid Al-Farsi acknowledges; `kubectl` unusable from laptop
16:24:00	API server p99 latency peaks at 8.3 seconds; API Priority and Fairness (APF) flow control begins shedding low-priority requests
16:24:10	Rashid identifies `db-provisioner` as top API request source via `kube-apiserver` audit log (`apiserver_request_total` metric by user)
16:25:00	Rashid pages Database Platform on-call Mei-Ling Chow; simultaneously attempts `kubectl scale deployment db-provisioner --replicas=0`
16:25:20	First `kubectl scale` attempt times out (API server latency); Rashid retries
16:25:55	Second `kubectl scale` attempt succeeds via a lucky low-latency window; `db-provisioner` deployment scaled to 0
16:26:00	API server request rate drops to baseline; latency begins recovering
16:27:30	API server p99 latency returns to 55ms; all `kubectl` operations responsive
16:28:00	HPA resumes metric collection; begins scaling workloads to meet traffic demand
16:29:00	Mei-Ling joins war room; confirms RBAC binding expiration as root cause from controller logs
16:33:00	New RBAC role binding `db-provisioner-crb-v2` created without `expirationSeconds` TTL (pending permanent fix)
16:41:00	`db-provisioner` redeployed at 1 replica with temporary fix: exponential backoff added via `workqueue.NewRateLimitingQueue(workqueue.DefaultControllerRateLimiter())`
16:44:00	Controller reconciles successfully with renewed binding; all `DatabaseCluster` resources reconciled; incident declared resolved
17:15:00	Kubernetes Infrastructure team confirms no node `NotReady` events occurred during the incident (node heartbeats maintained priority throughout)

Impact¶

Customer Impact¶

No direct customer-facing service outage occurred. The impact was mediated through the Kubernetes control plane: the Horizontal Pod Autoscaler was unable to scale for 19 minutes during a traffic peak that began at approximately 16:20. During this window, the affected services handled traffic at their pre-scale replica count, resulting in approximately 12% of requests being served at elevated latency (p99 climbed from 280ms to ~1.4 seconds). No requests were dropped. Three deployments that were initiated by other teams during the incident were unable to progress and remained in a Progressing state until the API server recovered.

Internal Impact¶

SRE: 2 engineers × 1.5 hours = 3 engineering-hours
Database Platform: 2 engineers × 2 hours = 4 engineering-hours (root cause analysis, RBAC fix, controller fix)
Kubernetes Infrastructure: 1 engineer × 1 hour = 1 engineering-hour (APF analysis, node health verification)
3 engineering teams had deployments blocked for 19–22 minutes; one deployment was to roll back a production bug and had to wait during the incident
Post-incident controller audit: ~8 additional engineering-hours

Data Impact¶

No data was lost or corrupted. The db-provisioner controller manages provisioning metadata (Kubernetes custom resources), not database data. The 403 error meant no provisioning operations were attempted, so no database state changes were made during the incident.

Root Cause¶

What Happened (Technical)¶

The db-provisioner operator was written approximately 18 months ago by the Database Platform team as an internal tool to manage DatabaseCluster and DatabaseUser custom resources. The controller was implemented using a raw client-go work queue (workqueue.NewFIFO()) rather than the controller-runtime framework's RateLimitingInterface. The controller-runtime framework provides a DefaultControllerRateLimiter() that implements exponential backoff with a base delay of 5ms and a maximum delay of 1000 seconds, along with a per-item rate limiter. The raw FIFO queue has no rate limiting at all.

The db-provisioner service account uses a ClusterRoleBinding that was created with expirationSeconds: 7776000 (90 days) as a security measure recommended during an access review. The binding was created on August 5, 2025, and expired at exactly 16:22:00 UTC on November 3, 2025. The expiration was not tracked in any monitoring system or calendar alert.

When the binding expired, the Kubernetes token controller automatically deleted it. The controller's reconcile function immediately received a 403 on its first API call (listing DatabaseCluster resources). The error handler returned the error to the work queue, which re-queued the item immediately. With no rate limiting, the controller cycled through reconcile-fail-requeue at approximately 600 times per second per reconcile goroutine. The controller runs with a MaxConcurrentReconciles of 3, so the effective request rate was approximately 1,800 req/s, climbing to ~2,000 req/s as the queue backed up with all managed objects.

The Kubernetes API server's API Priority and Fairness (APF) configuration did not have a dedicated flow schema or priority level for custom controllers. The db-provisioner service account's requests were classified under the default workload-high priority level, which shares a queue with HPA, node controller, and deployment controller traffic. The flood of 403 responses from db-provisioner consumed APF concurrency slots that should have been available for other controllers, causing HPA and deployment controller operations to queue and time out.

Contributing Factors¶

Controller implemented without controller-runtime rate limiting: The controller-runtime library's rate-limiting work queue is specifically designed to prevent runaway retry loops. Using a raw workqueue.NewFIFO() is a direct violation of Kubernetes controller best practices documented in the upstream controller-runtime README and the Kubernetes contributor guide. No code review checklist item existed to catch this pattern during the controller's original development or subsequent PRs.
RBAC binding TTL not tracked or alerted on: The expirationSeconds field on the ClusterRoleBinding was set intentionally as a security control, but no mechanism was put in place to alert the team before expiration. A RBAC binding that expires silently causes an immediate permissions failure in any controller that depends on it. The TTL should have been tracked as a recurring calendar item, a Prometheus alert on binding age, or implemented as a regularly rotated secret managed by an automated process.
No APF configuration for custom controllers: The cluster's API Priority and Fairness configuration uses Kubernetes defaults. The default configuration does not isolate custom controller traffic from core controller traffic. A dedicated FlowSchema and PriorityLevelConfiguration for custom controllers (e.g., at a level below workload-high) would have contained the flood of 403 requests to a separate priority bucket, allowing HPA, node controller, and deployment controller operations to continue unaffected.

What We Got Lucky About¶

The API server did not crash. Kubernetes API server's internal rate limiting and APF flow control allowed it to degrade gracefully under the 2,000 req/s flood. Node heartbeats (which have system-level priority in APF) continued to succeed throughout the incident, preventing any nodes from transitioning to NotReady — which would have triggered node eviction and could have turned a control-plane degradation into a data-plane outage with pod evictions.
Rashid was able to issue the kubectl scale --replicas=0 command successfully on the second attempt during a transient low-latency window. If the API server had been completely unresponsive, the mitigation path would have required either direct etcd manipulation to delete the deployment (dangerous and complex) or a node-level crictl command to kill the controller container — both significantly harder and riskier operations.

Detection¶

How We Detected¶

The incident was detected by a PagerDuty alert on kube-apiserver p99 request latency exceeding 3 seconds for 60 consecutive seconds. This alert fired at 16:23:15, approximately 70 seconds after the controller began its retry loop. The metric is computed from apiserver_request_duration_seconds in Prometheus and is one of the cluster's top-5 golden signals.

Why We Didn't Detect Sooner¶

Detection at T+70s is relatively fast for this class of incident. The 3-second threshold with a 60-second evaluation window is the correct balance for this metric — shorter windows produce too many false positives from transient spikes. A faster detection path would be a rate alert on the db-provisioner service account's API request volume (e.g., >100 req/s from a single service account), but no such alert existed. The per-service-account request rate is available in the API server audit log but was not being aggregated into a Prometheus metric.

Response¶

What Went Well¶

Rashid identified db-provisioner as the source of the API flood within 55 seconds of acknowledging the alert, using the apiserver_request_total metric broken down by user label — this is a pre-built panel on the cluster health dashboard that proved exactly useful in this scenario.
The mitigation action (scale controller to zero) was the correct minimal intervention — it stopped the flood immediately without requiring a code change, a restart of the API server, or any action on the data plane.
Node heartbeats continued throughout, meaning no secondary incidents (pod evictions, node NotReady alerts) were triggered. The team could focus entirely on the API server flood without managing cascading node failures.

What Went Poorly¶

The first kubectl scale command timed out because the API server was itself under load. The team had no documented backup path for "how to stop a misbehaving controller when the API server is too degraded to accept kubectl commands." Direct etcd manipulation or crictl container kill should be documented in the runbook as escalation options.
The RBAC binding expiration was completely invisible to the team until it failed. The expirationSeconds field is not surfaced in Kubernetes resource listing by default — you have to inspect the full YAML. A weekly audit or Prometheus alert on RBAC binding age would have given a 1-week warning.
The post-incident audit of other custom controllers revealed that 2 additional controllers in the cluster also use raw FIFO queues without rate limiting. These were not discovered until after the postmortem was written, which means the cluster remained at risk for 4 days after the incident.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
PM-015-01	Replace `workqueue.NewFIFO()` with `workqueue.NewRateLimitingQueue(workqueue.DefaultControllerRateLimiter())` in `db-provisioner` and all other in-house controllers; add to controller code review checklist	P0	Database Platform (Mei-Ling Chow)	Done (db-provisioner); In Progress (others)	2025-11-10
PM-015-02	Audit all in-house Kubernetes controllers for raw work queue usage; fix any identified before 2025-11-10	P0	Kubernetes Infrastructure (Lindiwe Dlamini)	In Progress	2025-11-10
PM-015-03	Create `FlowSchema` and `PriorityLevelConfiguration` for custom controllers at priority below `workload-high`; assign all in-house controller service accounts to this flow schema	P1	Kubernetes Infrastructure (Lindiwe Dlamini)	In Progress	2025-11-17
PM-015-04	Add Prometheus alert on per-service-account API request rate: fire if any single service account exceeds 100 req/s for 30 seconds	P1	SRE (Rashid Al-Farsi)	Planned	2025-11-14
PM-015-05	Implement RBAC binding expiration monitoring: Prometheus alert firing 7 days before any `ClusterRoleBinding` with `expirationSeconds` expires; run at cluster-wide scrape	P1	Kubernetes Infrastructure (Lindiwe Dlamini)	Planned	2025-11-17
PM-015-06	Add runbook section: "API server too degraded to accept kubectl — emergency controller stop via crictl"; document node SSH path and container name lookup	P2	SRE (Rashid Al-Farsi)	Planned	2025-11-14

Lessons Learned¶

Rate limiting in a Kubernetes controller is not optional — it is the mechanism that prevents the controller from becoming a denial-of-service source: The controller-runtime library's rate-limiting work queue exists precisely because retry-on-error without backoff is a known failure mode in distributed systems. Writing a controller without it is equivalent to writing an HTTP client that retries on 5xx with no backoff. The pattern should be enforced at code review time via checklist, linting, or static analysis.
Security controls with expiration dates must be paired with pre-expiration alerts: A TTL on an RBAC binding is a good security practice. A TTL with no monitoring is a time bomb. Every time-bounded security control — RBAC bindings, TLS certificates, API keys, service account tokens — must have an associated alert that fires far enough in advance (days to weeks) to allow planned renewal without an incident.
API Priority and Fairness is a cluster-level blast radius control that requires active configuration: Kubernetes APF with default settings does not protect core controllers from misbehaving custom controllers. Dedicating a lower-priority APF level to custom and third-party controllers is a cheap, effective way to ensure that a runaway custom controller cannot impair HPA, node management, or deployment rollouts.

Cross-References¶

Failure Pattern: Retry Storm Without Backoff; Missing Security Control Lifecycle Management; Shared Resource Exhaustion
Topic Packs: kubernetes-controllers (controller-runtime, work queues, rate limiting, reconcile loops), kubernetes-rbac (ClusterRoleBinding, expiration, service accounts), kubernetes-apiserver (API Priority and Fairness, flow schemas, audit logs)
Runbook: runbook-apiserver-high-latency.md, runbook-controller-runaway.md
Decision Tree: API server latency spike → identify top API request source by service account (apiserver_request_total by user) → if single service account dominating, scale controller to zero → identify root cause from controller logs → fix and redeploy with rate limiting confirmed