Skip to content

Portal | Level: L3: Advanced | Topics: etcd | Domain: Kubernetes

Scenario: etcd Database Space Exceeded

The Prompt

"The Kubernetes API server is returning errors. Nobody can create or update any resources. The error message says 'etcdserver: mvcc: database space exceeded.' What's happening and how do you fix it?"

Initial Report

Alert: "kube-apiserver returning 500 errors. All kubectl commands fail with 'etcdserver: mvcc: database space exceeded.' Existing pods are still running but nothing can be changed."

Constraints

  • Time pressure: No changes can be made to the cluster. Deployments, scaling, and ConfigMap updates are all blocked.
  • Limited access: You have SSH to control plane nodes but kubectl may be unreliable.

Observable Evidence

  • kubectl get pods returns: etcdserver: mvcc: database space exceeded
  • Existing pods continue running (etcd read-only, not down)
  • etcd DB size > 2GB (default quota)

Expected Investigation Path

# 1. Check etcd status (from control plane node)
ETCDCTL_API=3 etcdctl endpoint status --write-out=table \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 2. Check DB size (look for size > 2GB)

# 3. Enable alarm disarm to allow writes
ETCDCTL_API=3 etcdctl alarm disarm \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 4. Compact old revisions
LATEST_REV=$(ETCDCTL_API=3 etcdctl endpoint status --write-out=json \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key | jq '.[0].Status.header.revision')

ETCDCTL_API=3 etcdctl compact $LATEST_REV \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 5. Defragment
ETCDCTL_API=3 etcdctl defrag \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 6. Verify DB size reduced
ETCDCTL_API=3 etcdctl endpoint status --write-out=table ...

# 7. Verify kubectl works again
kubectl get nodes

Root Cause Possibilities

  1. Too many events — Events accumulate without TTL (default 1h but can be misconfigured)
  2. No auto-compaction--auto-compaction-retention not set
  3. Leak of ConfigMaps/Secrets — Operators creating resources without cleanup
  4. Large objects — ConfigMaps with large data (embedded files)

What a Strong Answer Includes

  • Understanding that etcd went read-only (not down) - existing workloads keep running
  • The alarm disarm -> compact -> defrag sequence
  • Root cause investigation: what filled up etcd?
  • Prevention: enable auto-compaction, monitor DB size, alert at 80% quota
  • Mention: increase quota if justified (--quota-backend-bytes)

Wiki Navigation