Postmortem: Single etcd Member Disk Full Degrades Control Plane¶

Field	Value
ID	PM-021
Date	2025-03-11
Severity	SEV-3
Duration	0h 45m (detection to resolution)
Time to Detect	12m
Time to Mitigate	33m
Customer Impact	None — control plane degradation was internal only; all running workloads remained healthy
Revenue Impact	None
Teams Involved	Platform Engineering, Site Reliability, Kubernetes Ops
Postmortem Author	Priya Subramanian
Postmortem Date	2025-03-14

Executive Summary¶

On 2025-03-11 at 14:22 UTC, one member of a production 3-node etcd cluster (etcd-2) ran out of disk space after etcd compaction was never configured and the defrag CronJob that had previously compensated was silently deleted during a routine cleanup ticket. The remaining two members maintained quorum, but write latency across the cluster increased threefold as the healthy members waited for the lagging, disk-full member during Raft consensus rounds. Kubernetes API server p99 latency climbed from ~200ms to 1.2s, causing HPA scaling decisions to stall and new pod scheduling to take 30+ seconds. The issue was resolved by manually freeing disk space on etcd-2, running etcd defragmentation, and restoring the CronJob. No workloads were lost and no customer traffic was affected.

Timeline (All times UTC)¶

Time	Event
2025-03-11 14:10	etcd-2 disk reaches 95% capacity; etcd begins logging `mvcc: database space exceeded` warnings at WARN level
2025-03-11 14:18	etcd-2 disk hits 100%; etcd-2 transitions to `alarm: NOSPACE`; write requests to etcd-2 begin failing
2025-03-11 14:22	API server p99 latency alert fires: `kube_apiserver_request_duration_seconds{p99} > 800ms` for 3 consecutive minutes
2025-03-11 14:24	On-call SRE (Tomás Reyes) acknowledges alert; begins investigating API server logs
2025-03-11 14:27	Tomás rules out API server pod issues; shifts focus to etcd; runs `etcdctl endpoint status`
2025-03-11 14:29	`etcdctl endpoint status` reveals etcd-2 DB size is 8.0 GiB (at limit); etcd-2 flagged as `NOSPACE` alarm active
2025-03-11 14:31	Tomás confirms etcd compaction is not configured (`--auto-compaction-retention` not set); cluster has never been compacted
2025-03-11 14:33	Tomás searches for defrag CronJob; finds it absent from the `kube-system` namespace
2025-03-11 14:35	Git log review finds CronJob deleted 18 days earlier in PR #4401 ("cleanup unused CronJobs") by engineer Dmitri Volkov
2025-03-11 14:37	Tomás manually triggers etcd compaction: `etcdctl compact $(etcdctl endpoint status --write-out="json" \| jq '.[0].Status.header.revision')`
2025-03-11 14:39	Compaction completes; DB size drops from 8.0 GiB to 2.1 GiB on etcd-2
2025-03-11 14:41	Tomás runs `etcdctl defrag --endpoints=etcd-2:2379`; physical disk space recovered; alarm cleared
2025-03-11 14:45	etcd-2 rejoins cluster normally; Raft write latency returns to baseline
2025-03-11 14:50	API server p99 latency drops back to ~190ms; HPA scaling resumes; alert resolves
2025-03-11 15:07	Defrag CronJob re-applied from backup manifest; compaction retention set to `--auto-compaction-retention=1h` on all etcd members

Impact¶

Customer Impact¶

None. All running pods continued to operate normally. External API traffic was served without interruption. The Kubernetes control plane experienced elevated latency internally, but no user-visible endpoints went down.

Internal Impact¶

HPA scaling was stalled for approximately 33 minutes. Two deployments that were under load did not scale out as expected. Engineering teams investigating those deployments lost approximately 4 person-hours to false-lead debugging before the root cause was confirmed.
New pod scheduling was delayed 30-60 seconds per request during the incident window. CI/CD pipelines that spawned new pods (integration test runners) saw pipeline stage timeouts, requiring manual reruns — estimated 6 pipeline-hours lost.
On-call SRE (Tomás Reyes) spent 45 minutes on the incident. Platform Engineering lead (Priya Subramanian) spent 90 minutes on postmortem and remediation planning.

Data Impact¶

No data was lost. etcd WAL and snapshot data were fully intact on all three members. Compaction removed only obsolete MVCC revisions as designed.

Root Cause¶

What Happened (Technical)¶

The production etcd cluster was deployed without --auto-compaction-retention configured, meaning etcd retained every MVCC revision written since cluster initialization. In a production cluster with active workloads, etcd accumulates revisions rapidly: every ConfigMap update, Secret rotation, endpoint slice change, and pod status transition appends a new revision. Without compaction, the WAL and backend database grow monotonically.

To compensate, the Platform Engineering team had deployed a CronJob in kube-system that ran etcdctl defrag and etcdctl compact nightly. This CronJob was not labeled or annotated to indicate it was load-bearing infrastructure. During a cluster hygiene ticket (#4401, "Remove unused CronJobs from kube-system"), engineer Dmitri Volkov identified the CronJob as having no recent invocations visible in the job history (the history limit had been set to 0, so completed jobs were not retained) and deleted it assuming it was stale.

Over the following 18 days, etcd accumulated approximately 5.9 GiB of additional revision history. When etcd-2's disk reached capacity, etcd set an internal NOSPACE alarm and refused further writes from that member. In a Raft-based distributed log, the leader must wait for a majority of members to acknowledge each write before committing. With etcd-2 unable to accept writes, the two healthy members (etcd-0 and etcd-1) continued to form quorum and commit writes — but the consensus round-trip now included timeouts waiting for etcd-2's failed acknowledgements before proceeding, inflating per-operation latency by approximately 3x.

The Kubernetes API server, which issues etcd writes for nearly every API operation, passed this latency through to callers. Watches and list operations, which normally complete in under 100ms, began taking 400-900ms. HPA, which depends on timely API server responses to issue scale events, effectively stalled. The scheduler, similarly dependent on etcd writes to create binding objects, exhibited 30+ second scheduling delays.

Mitigation was straightforward: compact the revision history to reclaim logical space, defragment to reclaim physical disk space, and clear the etcd NOSPACE alarm. Recovery was fast once the root cause was identified. Re-adding auto-compaction configuration prevents recurrence.

Contributing Factors¶

No etcd-specific disk alerting: Node-level disk usage alerts were configured at 80% and 90% of total disk, but etcd's data directory was on a dedicated volume that was not separately monitored. The 80% threshold alert had fired 3 days earlier but was routed to a storage-team queue and treated as a capacity planning notice, not an operational emergency.
CronJob lacked infrastructure classification: The defrag CronJob had no labels, annotations, or documentation connecting it to etcd operations. It was indistinguishable from disposable jobs in automated scans. No change review policy required verification that a CronJob was truly unused before deletion.
Auto-compaction was never configured at cluster bootstrap: The cluster's etcd configuration predated the team's current runbook, which now mandates --auto-compaction-retention. The gap was never caught in a configuration audit because there was no automated check for required etcd flags.
successfulJobsHistoryLimit: 0 hid the CronJob's activity: The CronJob was configured to retain no completed job records. An engineer inspecting the namespace saw no recent jobs associated with it and incorrectly concluded it was inactive.

What We Got Lucky About¶

The cluster maintained quorum throughout. etcd-0 and etcd-1 remained healthy, so the control plane continued to function in a degraded (high-latency) state rather than becoming fully unavailable. Had a second member experienced any issue — disk, network, or process — during the same window, the entire control plane would have lost quorum and become read-only.
The incident occurred mid-afternoon on a Tuesday with full staffing. The on-call engineer was immediately available and experienced with etcd. A weekend occurrence with a less experienced responder could have extended the incident significantly.

Detection¶

How We Detected¶

A Prometheus alert on Kubernetes API server request duration fired after p99 latency exceeded 800ms for 3 consecutive minutes. The alert was configured by the Platform Engineering team during a previous incident (PM-017) and was the primary detection mechanism.

Why We Didn't Detect Sooner¶

The etcd data volume's disk utilization crossed 80% three days before the incident, but that alert routed to a storage team queue as a capacity planning item and was not treated as time-sensitive. There were no etcd-specific alerts on WAL entry count, DB size relative to etcd's configured quota, or NOSPACE alarm state. The NOSPACE alarm itself is exposed via etcdctl endpoint status and the etcd metrics endpoint (etcd_server_quota_backend_bytes vs etcd_mvcc_db_total_size_in_bytes) but no alert was configured on the ratio.

Response¶

What Went Well¶

The API server latency alert fired promptly and gave the on-call engineer a clear starting point.
Tomás correctly pivoted from API server logs to etcd within 3 minutes, using etcdctl endpoint status as a first diagnostic step — a habit established during team runbook training.
The remediation steps (compact, defrag, clear alarm) were executed in the correct sequence without error, avoiding the risk of defragmenting before compaction (which would have had no effect on logical space).
The deleted CronJob manifest was recoverable from git history within 2 minutes of identifying the gap.

What Went Poorly¶

The storage alert that fired 3 days earlier was not escalated or cross-referenced against etcd's specific data volume. A human review of that alert at the time could have prevented the incident.
There was no change review step that would have caught the CronJob deletion before it was merged. PR #4401 was approved and merged in under 10 minutes with no infrastructure-impact assessment.
Auto-compaction was not set at cluster bootstrap, and no configuration audit existed to catch missing required etcd flags. The gap persisted for the entire lifetime of the cluster.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
PM-021-01	Add Prometheus alerts on `etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.70` and `etcd_server_is_leader` changes	P1	Tomás Reyes	In Progress	2025-03-18
PM-021-02	Enable `--auto-compaction-retention=1h` on all etcd clusters (prod, staging, dev) via config management; add to bootstrap runbook	P1	Priya Subramanian	In Progress	2025-03-18
PM-021-03	Add label `infra-critical: "true"` to all load-bearing CronJobs; update hygiene ticket template to require validation that any deleted CronJob carries no such label	P2	Dmitri Volkov	Open	2025-03-25
PM-021-04	Set `successfulJobsHistoryLimit: 3` and `failedJobsHistoryLimit: 3` on all infrastructure CronJobs so job history is visible	P2	Dmitri Volkov	Open	2025-03-25
PM-021-05	Write OPA/Kyverno policy to block deletion of CronJobs labeled `infra-critical: "true"` without team-lead approval annotation	P3	Platform Engineering	Open	2025-04-08
PM-021-06	Add etcd configuration audit to quarterly cluster health checklist: verify compaction, defrag schedule, quota settings, and member disk headroom	P2	Priya Subramanian	Open	2025-03-31

Lessons Learned¶

etcd requires its own observability layer, separate from node monitoring. etcd has domain-specific failure modes (NOSPACE alarm, leader flapping, slow follower) that generic node metrics cannot capture. etcd's own Prometheus metrics endpoint must be scraped and alerted on independently.
"Cosmetic" infrastructure changes need impact assessment. Deleting a CronJob, renaming a resource, or changing a label may appear cosmetic but can silently remove a load-bearing operational dependency. Change templates should require a stated answer to "what breaks if this is gone?"
A degraded quorum is not a safe quorum. Two-of-three etcd quorum with one member in NOSPACE is fragile: any additional failure tips the cluster into full unavailability. Degraded states in distributed consensus systems should be treated as emergencies even when the system appears functional.

Cross-References¶

Failure Pattern: Configuration drift — required operational settings (auto-compaction) absent from cluster bootstrap and undetected by audit
Topic Packs: etcd operations, Kubernetes control plane internals, Raft consensus, distributed storage quotas
Runbook: runbooks/kubernetes/etcd-disk-full.md
Decision Tree: Kubernetes API server high latency → etcd health check → member status → disk/quota → compact/defrag sequence