Portal | Level: L3: Advanced | Topics: etcd | Domain: Kubernetes
etcd Troubleshooting Scenarios¶
Overview¶
etcd is the brain of Kubernetes. Every resource, every secret, every config lives in etcd. When etcd is unhealthy, the entire cluster is at risk. These scenarios cover the most common etcd failures and how to diagnose and fix them.
Prerequisites¶
# etcdctl setup (adjust endpoints for your cluster)
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/healthcheck-client.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/healthcheck-client.key
# For k3s:
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
export ETCDCTL_CERT=/var/lib/rancher/k3s/server/tls/etcd/client.crt
export ETCDCTL_KEY=/var/lib/rancher/k3s/server/tls/etcd/client.key
Scenario 1: etcd Health Check¶
Task: Verify etcd cluster health and membership.
# Check endpoint health
etcdctl endpoint health --write-out=table
# Check endpoint status (leader, DB size, raft index)
etcdctl endpoint status --write-out=table
# List members
etcdctl member list --write-out=table
Expected output:
+------------------+---------+--------+---------------------------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+------------------+---------+--------+---------------------------+
| https://...:2379 | true | 2.3ms | |
+------------------+---------+--------+---------------------------+
Red flags: - Health = false - TOOK > 100ms (slow disk I/O) - Different raft indices across members (replication lag)
Scenario 2: etcd Database Too Large¶
Symptoms: API server returns "etcdserver: mvcc: database space exceeded"
Diagnosis:
# Check DB size
etcdctl endpoint status --write-out=table
# Look at DB SIZE column (default limit: 2GB)
# Check how many keys
etcdctl get / --prefix --keys-only | wc -l
# Find large key ranges
for prefix in /registry/events /registry/pods /registry/secrets /registry/configmaps; do
count=$(etcdctl get $prefix --prefix --keys-only 2>/dev/null | wc -l)
echo "$prefix: $count keys"
done
Fix:
# Step 1: Compact old revisions
LATEST_REV=$(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
etcdctl compact $LATEST_REV
# Step 2: Defragment
etcdctl defrag --endpoints=https://127.0.0.1:2379
# Step 3: Verify
etcdctl endpoint status --write-out=table
Prevention: Set up auto-compaction: --auto-compaction-retention=1 (1 hour)
Scenario 3: Slow etcd (High Latency)¶
Symptoms: kubectl commands are slow (>2s), API server timeouts
Diagnosis:
# Check disk I/O latency
etcdctl endpoint health --write-out=table
# TOOK > 100ms = slow disk
# Check etcd metrics
curl -s --cacert $ETCDCTL_CACERT --cert $ETCDCTL_CERT --key $ETCDCTL_KEY \
https://127.0.0.1:2379/metrics | grep -E "etcd_disk_wal_fsync_duration_seconds|etcd_disk_backend_commit_duration"
# Check system disk I/O
iostat -x 1 5 # Look for high await times
Fix:
- Move etcd data directory to SSD/NVMe
- Use dedicated disk for etcd (not shared with other workloads)
- Reduce etcd load (fewer watchers, smaller objects)
- Defragment: etcdctl defrag
Prometheus alerts:
# WAL fsync latency > 100ms
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.1
# Backend commit latency > 250ms
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25
Scenario 4: etcd Member Down¶
Symptoms: etcdctl endpoint health shows one member unhealthy
Diagnosis:
# Check member list
etcdctl member list --write-out=table
# Check if etcd process is running
systemctl status etcd # or check the etcd pod
kubectl get pods -n kube-system -l component=etcd
# Check etcd logs
journalctl -u etcd --since "10 min ago" --no-pager
# or
kubectl logs -n kube-system etcd-<node-name> --tail=100
Fix (member rejoining):
# If member can restart, just restart it
systemctl restart etcd
# If member data is corrupted:
# 1. Remove the member
etcdctl member remove <member-id>
# 2. Clear data directory
rm -rf /var/lib/etcd/member
# 3. Re-add the member
etcdctl member add <name> --peer-urls=https://<ip>:2380
# 4. Start etcd with --initial-cluster-state=existing
Scenario 5: etcd Backup and Restore¶
Backup:
# Take a snapshot
etcdctl snapshot save /tmp/etcd-backup-$(date +%Y%m%d-%H%M%S).db
# Verify snapshot
etcdctl snapshot status /tmp/etcd-backup-*.db --write-out=table
Restore (disaster recovery):
# Stop kube-apiserver and etcd
systemctl stop kube-apiserver
systemctl stop etcd
# Restore from snapshot
etcdctl snapshot restore /tmp/etcd-backup.db \
--data-dir=/var/lib/etcd-restored \
--name=node1 \
--initial-cluster=node1=https://10.0.0.1:2380 \
--initial-advertise-peer-urls=https://10.0.0.1:2380
# Replace data directory
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-restored /var/lib/etcd
# Start etcd and kube-apiserver
systemctl start etcd
systemctl start kube-apiserver
# Verify
etcdctl endpoint health
kubectl get nodes
Scenario 6: etcd Certificate Issues¶
Symptoms: "tls: bad certificate" or "transport: authentication handshake failed"
Diagnosis:
# Check cert expiry
openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -dates
# Check cert SAN (must include node IP)
openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -ext subjectAltName
# Check if certs match
openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -modulus | md5sum
openssl rsa -in /etc/kubernetes/pki/etcd/server.key -noout -modulus | md5sum
# Should match
Fix: Rotate certificates using kubeadm or manually regenerate with the correct SANs.
Scenario 7: Leader Election Thrashing¶
Symptoms: Frequent leader changes, slow API responses
Diagnosis:
# Check leader changes
etcdctl endpoint status --write-out=table
# Note which member is leader
# Monitor leader changes over time
curl -s --cacert $ETCDCTL_CACERT --cert $ETCDCTL_CERT --key $ETCDCTL_KEY \
https://127.0.0.1:2379/metrics | grep etcd_server_leader_changes_seen_total
Common causes: - Network latency between members > election timeout - Disk I/O too slow (WAL writes taking too long) - CPU contention (etcd competing with other processes)
Fix:
- Ensure dedicated resources for etcd
- Increase --heartbeat-interval and --election-timeout
- Move to faster disks
- Reduce network latency between etcd members
Scenario 8: Watch Event Overload¶
Symptoms: etcd using excessive memory, slow watch responses
Diagnosis:
# Check watch count
curl -s --cacert $ETCDCTL_CACERT --cert $ETCDCTL_CERT --key $ETCDCTL_KEY \
https://127.0.0.1:2379/metrics | grep etcd_debugging_mvcc_watcher_total
# Check number of events
curl -s --cacert $ETCDCTL_CACERT --cert $ETCDCTL_CERT --key $ETCDCTL_KEY \
https://127.0.0.1:2379/metrics | grep etcd_debugging_mvcc_events_total
Common causes: Operators or controllers watching too many resources with tight reconciliation loops.
Fix: Identify and fix misbehaving controllers. Use label selectors to narrow watch scope.
Key etcd Metrics to Monitor¶
| Metric | What it means | Alert threshold |
|---|---|---|
etcd_disk_wal_fsync_duration_seconds |
WAL sync latency | p99 > 100ms |
etcd_disk_backend_commit_duration_seconds |
Backend commit latency | p99 > 250ms |
etcd_server_leader_changes_seen_total |
Leader election changes | > 3/hour |
etcd_server_proposals_failed_total |
Failed raft proposals | > 0 sustained |
etcd_mvcc_db_total_size_in_bytes |
Database size | > 80% of quota |
etcd_network_peer_round_trip_time_seconds |
Peer RTT | p99 > 50ms |
Quick etcd Health Checklist¶
- All members healthy:
etcdctl endpoint health - DB size < 2GB:
etcdctl endpoint status - WAL fsync < 100ms: check metrics
- Leader stable: no frequent changes
- Backup recent: < 24 hours old
- Certs not expiring soon: > 30 days remaining
- Disk space available: > 20% free on etcd data volume
Wiki Navigation¶
Prerequisites¶
- Kubernetes Ops (Production) (Topic Pack, L2)
Related Content¶
- Interview: etcd Space Exceeded (Scenario, L3) — etcd
- Runbook: etcd Backup & Restore (Runbook, L2) — etcd
- Runbook: etcd High Latency / Slow Operations (Runbook, L3) — etcd
- Skillcheck: etcd (Assessment, L2) — etcd
- etcd (Topic Pack, L1) — etcd
- etcd Drills (Drill, L2) — etcd
- etcd Flashcards (CLI) (flashcard_deck, L1) — etcd