Portal | Level: L3: Advanced | Topics: etcd | Domain: Kubernetes

Scenario: etcd Database Space Exceeded¶

The Prompt¶

"The Kubernetes API server is returning errors. Nobody can create or update any resources. The error message says 'etcdserver: mvcc: database space exceeded.' What's happening and how do you fix it?"

Initial Report¶

Alert: "kube-apiserver returning 500 errors. All kubectl commands fail with 'etcdserver: mvcc: database space exceeded.' Existing pods are still running but nothing can be changed."

Constraints¶

Time pressure: No changes can be made to the cluster. Deployments, scaling, and ConfigMap updates are all blocked.
Limited access: You have SSH to control plane nodes but kubectl may be unreliable.

Observable Evidence¶

kubectl get pods returns: etcdserver: mvcc: database space exceeded
Existing pods continue running (etcd read-only, not down)
etcd DB size > 2GB (default quota)

Expected Investigation Path¶

# 1. Check etcd status (from control plane node)
ETCDCTL_API=3 etcdctl endpoint status --write-out=table \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 2. Check DB size (look for size > 2GB)

# 3. Enable alarm disarm to allow writes
ETCDCTL_API=3 etcdctl alarm disarm \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 4. Compact old revisions
LATEST_REV=$(ETCDCTL_API=3 etcdctl endpoint status --write-out=json \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key | jq '.[0].Status.header.revision')

ETCDCTL_API=3 etcdctl compact $LATEST_REV \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 5. Defragment
ETCDCTL_API=3 etcdctl defrag \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 6. Verify DB size reduced
ETCDCTL_API=3 etcdctl endpoint status --write-out=table ...

# 7. Verify kubectl works again
kubectl get nodes

Root Cause Possibilities¶

Too many events — Events accumulate without TTL (default 1h but can be misconfigured)
No auto-compaction — --auto-compaction-retention not set
Leak of ConfigMaps/Secrets — Operators creating resources without cleanup
Large objects — ConfigMaps with large data (embedded files)

What a Strong Answer Includes¶

Understanding that etcd went read-only (not down) - existing workloads keep running
The alarm disarm -> compact -> defrag sequence
Root cause investigation: what filled up etcd?
Prevention: enable auto-compaction, monitor DB size, alert at 80% quota
Mention: increase quota if justified (--quota-backend-bytes)

Runbook: etcd Backup & Restore (Runbook, L2) — etcd
Runbook: etcd High Latency / Slow Operations (Runbook, L3) — etcd
Scenario: etcd Troubleshooting (Scenario, L3) — etcd
Skillcheck: etcd (Assessment, L2) — etcd
etcd (Topic Pack, L1) — etcd
etcd Drills (Drill, L2) — etcd
etcd Flashcards (CLI) (flashcard_deck, L1) — etcd

Scenario: etcd Database Space Exceeded¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Root Cause Possibilities¶

What a Strong Answer Includes¶

Wiki Navigation¶

Pages that link here¶

Scenario: etcd Database Space Exceeded¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Root Cause Possibilities¶

What a Strong Answer Includes¶

Wiki Navigation¶

Related Content¶

Pages that link here¶