Portal | Level: L3: Advanced | Topics: etcd | Domain: Kubernetes
Scenario: etcd Database Space Exceeded¶
The Prompt¶
"The Kubernetes API server is returning errors. Nobody can create or update any resources. The error message says 'etcdserver: mvcc: database space exceeded.' What's happening and how do you fix it?"
Initial Report¶
Alert: "kube-apiserver returning 500 errors. All kubectl commands fail with 'etcdserver: mvcc: database space exceeded.' Existing pods are still running but nothing can be changed."
Constraints¶
- Time pressure: No changes can be made to the cluster. Deployments, scaling, and ConfigMap updates are all blocked.
- Limited access: You have SSH to control plane nodes but kubectl may be unreliable.
Observable Evidence¶
kubectl get podsreturns:etcdserver: mvcc: database space exceeded- Existing pods continue running (etcd read-only, not down)
- etcd DB size > 2GB (default quota)
Expected Investigation Path¶
# 1. Check etcd status (from control plane node)
ETCDCTL_API=3 etcdctl endpoint status --write-out=table \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# 2. Check DB size (look for size > 2GB)
# 3. Enable alarm disarm to allow writes
ETCDCTL_API=3 etcdctl alarm disarm \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# 4. Compact old revisions
LATEST_REV=$(ETCDCTL_API=3 etcdctl endpoint status --write-out=json \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key | jq '.[0].Status.header.revision')
ETCDCTL_API=3 etcdctl compact $LATEST_REV \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# 5. Defragment
ETCDCTL_API=3 etcdctl defrag \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# 6. Verify DB size reduced
ETCDCTL_API=3 etcdctl endpoint status --write-out=table ...
# 7. Verify kubectl works again
kubectl get nodes
Root Cause Possibilities¶
- Too many events — Events accumulate without TTL (default 1h but can be misconfigured)
- No auto-compaction —
--auto-compaction-retentionnot set - Leak of ConfigMaps/Secrets — Operators creating resources without cleanup
- Large objects — ConfigMaps with large data (embedded files)
What a Strong Answer Includes¶
- Understanding that etcd went read-only (not down) - existing workloads keep running
- The alarm disarm -> compact -> defrag sequence
- Root cause investigation: what filled up etcd?
- Prevention: enable auto-compaction, monitor DB size, alert at 80% quota
- Mention: increase quota if justified (
--quota-backend-bytes)
Wiki Navigation¶
Related Content¶
- Runbook: etcd Backup & Restore (Runbook, L2) — etcd
- Runbook: etcd High Latency / Slow Operations (Runbook, L3) — etcd
- Scenario: etcd Troubleshooting (Scenario, L3) — etcd
- Skillcheck: etcd (Assessment, L2) — etcd
- etcd (Topic Pack, L1) — etcd
- etcd Drills (Drill, L2) — etcd
- etcd Flashcards (CLI) (flashcard_deck, L1) — etcd