Skip to content

Postmortem: Single etcd Member Disk Full Degrades Control Plane

Field Value
ID PM-021
Date 2025-03-11
Severity SEV-3
Duration 0h 45m (detection to resolution)
Time to Detect 12m
Time to Mitigate 33m
Customer Impact None — control plane degradation was internal only; all running workloads remained healthy
Revenue Impact None
Teams Involved Platform Engineering, Site Reliability, Kubernetes Ops
Postmortem Author Priya Subramanian
Postmortem Date 2025-03-14

Executive Summary

On 2025-03-11 at 14:22 UTC, one member of a production 3-node etcd cluster (etcd-2) ran out of disk space after etcd compaction was never configured and the defrag CronJob that had previously compensated was silently deleted during a routine cleanup ticket. The remaining two members maintained quorum, but write latency across the cluster increased threefold as the healthy members waited for the lagging, disk-full member during Raft consensus rounds. Kubernetes API server p99 latency climbed from ~200ms to 1.2s, causing HPA scaling decisions to stall and new pod scheduling to take 30+ seconds. The issue was resolved by manually freeing disk space on etcd-2, running etcd defragmentation, and restoring the CronJob. No workloads were lost and no customer traffic was affected.

Timeline (All times UTC)

Time Event
2025-03-11 14:10 etcd-2 disk reaches 95% capacity; etcd begins logging mvcc: database space exceeded warnings at WARN level
2025-03-11 14:18 etcd-2 disk hits 100%; etcd-2 transitions to alarm: NOSPACE; write requests to etcd-2 begin failing
2025-03-11 14:22 API server p99 latency alert fires: kube_apiserver_request_duration_seconds{p99} > 800ms for 3 consecutive minutes
2025-03-11 14:24 On-call SRE (Tomás Reyes) acknowledges alert; begins investigating API server logs
2025-03-11 14:27 Tomás rules out API server pod issues; shifts focus to etcd; runs etcdctl endpoint status
2025-03-11 14:29 etcdctl endpoint status reveals etcd-2 DB size is 8.0 GiB (at limit); etcd-2 flagged as NOSPACE alarm active
2025-03-11 14:31 Tomás confirms etcd compaction is not configured (--auto-compaction-retention not set); cluster has never been compacted
2025-03-11 14:33 Tomás searches for defrag CronJob; finds it absent from the kube-system namespace
2025-03-11 14:35 Git log review finds CronJob deleted 18 days earlier in PR #4401 ("cleanup unused CronJobs") by engineer Dmitri Volkov
2025-03-11 14:37 Tomás manually triggers etcd compaction: etcdctl compact $(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision')
2025-03-11 14:39 Compaction completes; DB size drops from 8.0 GiB to 2.1 GiB on etcd-2
2025-03-11 14:41 Tomás runs etcdctl defrag --endpoints=etcd-2:2379; physical disk space recovered; alarm cleared
2025-03-11 14:45 etcd-2 rejoins cluster normally; Raft write latency returns to baseline
2025-03-11 14:50 API server p99 latency drops back to ~190ms; HPA scaling resumes; alert resolves
2025-03-11 15:07 Defrag CronJob re-applied from backup manifest; compaction retention set to --auto-compaction-retention=1h on all etcd members

Impact

Customer Impact

None. All running pods continued to operate normally. External API traffic was served without interruption. The Kubernetes control plane experienced elevated latency internally, but no user-visible endpoints went down.

Internal Impact

  • HPA scaling was stalled for approximately 33 minutes. Two deployments that were under load did not scale out as expected. Engineering teams investigating those deployments lost approximately 4 person-hours to false-lead debugging before the root cause was confirmed.
  • New pod scheduling was delayed 30-60 seconds per request during the incident window. CI/CD pipelines that spawned new pods (integration test runners) saw pipeline stage timeouts, requiring manual reruns — estimated 6 pipeline-hours lost.
  • On-call SRE (Tomás Reyes) spent 45 minutes on the incident. Platform Engineering lead (Priya Subramanian) spent 90 minutes on postmortem and remediation planning.

Data Impact

No data was lost. etcd WAL and snapshot data were fully intact on all three members. Compaction removed only obsolete MVCC revisions as designed.

Root Cause

What Happened (Technical)

The production etcd cluster was deployed without --auto-compaction-retention configured, meaning etcd retained every MVCC revision written since cluster initialization. In a production cluster with active workloads, etcd accumulates revisions rapidly: every ConfigMap update, Secret rotation, endpoint slice change, and pod status transition appends a new revision. Without compaction, the WAL and backend database grow monotonically.

To compensate, the Platform Engineering team had deployed a CronJob in kube-system that ran etcdctl defrag and etcdctl compact nightly. This CronJob was not labeled or annotated to indicate it was load-bearing infrastructure. During a cluster hygiene ticket (#4401, "Remove unused CronJobs from kube-system"), engineer Dmitri Volkov identified the CronJob as having no recent invocations visible in the job history (the history limit had been set to 0, so completed jobs were not retained) and deleted it assuming it was stale.

Over the following 18 days, etcd accumulated approximately 5.9 GiB of additional revision history. When etcd-2's disk reached capacity, etcd set an internal NOSPACE alarm and refused further writes from that member. In a Raft-based distributed log, the leader must wait for a majority of members to acknowledge each write before committing. With etcd-2 unable to accept writes, the two healthy members (etcd-0 and etcd-1) continued to form quorum and commit writes — but the consensus round-trip now included timeouts waiting for etcd-2's failed acknowledgements before proceeding, inflating per-operation latency by approximately 3x.

The Kubernetes API server, which issues etcd writes for nearly every API operation, passed this latency through to callers. Watches and list operations, which normally complete in under 100ms, began taking 400-900ms. HPA, which depends on timely API server responses to issue scale events, effectively stalled. The scheduler, similarly dependent on etcd writes to create binding objects, exhibited 30+ second scheduling delays.

Mitigation was straightforward: compact the revision history to reclaim logical space, defragment to reclaim physical disk space, and clear the etcd NOSPACE alarm. Recovery was fast once the root cause was identified. Re-adding auto-compaction configuration prevents recurrence.

Contributing Factors

  1. No etcd-specific disk alerting: Node-level disk usage alerts were configured at 80% and 90% of total disk, but etcd's data directory was on a dedicated volume that was not separately monitored. The 80% threshold alert had fired 3 days earlier but was routed to a storage-team queue and treated as a capacity planning notice, not an operational emergency.
  2. CronJob lacked infrastructure classification: The defrag CronJob had no labels, annotations, or documentation connecting it to etcd operations. It was indistinguishable from disposable jobs in automated scans. No change review policy required verification that a CronJob was truly unused before deletion.
  3. Auto-compaction was never configured at cluster bootstrap: The cluster's etcd configuration predated the team's current runbook, which now mandates --auto-compaction-retention. The gap was never caught in a configuration audit because there was no automated check for required etcd flags.
  4. successfulJobsHistoryLimit: 0 hid the CronJob's activity: The CronJob was configured to retain no completed job records. An engineer inspecting the namespace saw no recent jobs associated with it and incorrectly concluded it was inactive.

What We Got Lucky About

  1. The cluster maintained quorum throughout. etcd-0 and etcd-1 remained healthy, so the control plane continued to function in a degraded (high-latency) state rather than becoming fully unavailable. Had a second member experienced any issue — disk, network, or process — during the same window, the entire control plane would have lost quorum and become read-only.
  2. The incident occurred mid-afternoon on a Tuesday with full staffing. The on-call engineer was immediately available and experienced with etcd. A weekend occurrence with a less experienced responder could have extended the incident significantly.

Detection

How We Detected

A Prometheus alert on Kubernetes API server request duration fired after p99 latency exceeded 800ms for 3 consecutive minutes. The alert was configured by the Platform Engineering team during a previous incident (PM-017) and was the primary detection mechanism.

Why We Didn't Detect Sooner

The etcd data volume's disk utilization crossed 80% three days before the incident, but that alert routed to a storage team queue as a capacity planning item and was not treated as time-sensitive. There were no etcd-specific alerts on WAL entry count, DB size relative to etcd's configured quota, or NOSPACE alarm state. The NOSPACE alarm itself is exposed via etcdctl endpoint status and the etcd metrics endpoint (etcd_server_quota_backend_bytes vs etcd_mvcc_db_total_size_in_bytes) but no alert was configured on the ratio.

Response

What Went Well

  1. The API server latency alert fired promptly and gave the on-call engineer a clear starting point.
  2. Tomás correctly pivoted from API server logs to etcd within 3 minutes, using etcdctl endpoint status as a first diagnostic step — a habit established during team runbook training.
  3. The remediation steps (compact, defrag, clear alarm) were executed in the correct sequence without error, avoiding the risk of defragmenting before compaction (which would have had no effect on logical space).
  4. The deleted CronJob manifest was recoverable from git history within 2 minutes of identifying the gap.

What Went Poorly

  1. The storage alert that fired 3 days earlier was not escalated or cross-referenced against etcd's specific data volume. A human review of that alert at the time could have prevented the incident.
  2. There was no change review step that would have caught the CronJob deletion before it was merged. PR #4401 was approved and merged in under 10 minutes with no infrastructure-impact assessment.
  3. Auto-compaction was not set at cluster bootstrap, and no configuration audit existed to catch missing required etcd flags. The gap persisted for the entire lifetime of the cluster.

Action Items

ID Action Priority Owner Status Due Date
PM-021-01 Add Prometheus alerts on etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.70 and etcd_server_is_leader changes P1 Tomás Reyes In Progress 2025-03-18
PM-021-02 Enable --auto-compaction-retention=1h on all etcd clusters (prod, staging, dev) via config management; add to bootstrap runbook P1 Priya Subramanian In Progress 2025-03-18
PM-021-03 Add label infra-critical: "true" to all load-bearing CronJobs; update hygiene ticket template to require validation that any deleted CronJob carries no such label P2 Dmitri Volkov Open 2025-03-25
PM-021-04 Set successfulJobsHistoryLimit: 3 and failedJobsHistoryLimit: 3 on all infrastructure CronJobs so job history is visible P2 Dmitri Volkov Open 2025-03-25
PM-021-05 Write OPA/Kyverno policy to block deletion of CronJobs labeled infra-critical: "true" without team-lead approval annotation P3 Platform Engineering Open 2025-04-08
PM-021-06 Add etcd configuration audit to quarterly cluster health checklist: verify compaction, defrag schedule, quota settings, and member disk headroom P2 Priya Subramanian Open 2025-03-31

Lessons Learned

  1. etcd requires its own observability layer, separate from node monitoring. etcd has domain-specific failure modes (NOSPACE alarm, leader flapping, slow follower) that generic node metrics cannot capture. etcd's own Prometheus metrics endpoint must be scraped and alerted on independently.
  2. "Cosmetic" infrastructure changes need impact assessment. Deleting a CronJob, renaming a resource, or changing a label may appear cosmetic but can silently remove a load-bearing operational dependency. Change templates should require a stated answer to "what breaks if this is gone?"
  3. A degraded quorum is not a safe quorum. Two-of-three etcd quorum with one member in NOSPACE is fragile: any additional failure tips the cluster into full unavailability. Degraded states in distributed consensus systems should be treated as emergencies even when the system appears functional.

Cross-References

  • Failure Pattern: Configuration drift — required operational settings (auto-compaction) absent from cluster bootstrap and undetected by audit
  • Topic Packs: etcd operations, Kubernetes control plane internals, Raft consensus, distributed storage quotas
  • Runbook: runbooks/kubernetes/etcd-disk-full.md
  • Decision Tree: Kubernetes API server high latency → etcd health check → member status → disk/quota → compact/defrag sequence