Skip to content

Portal | Level: L3: Advanced | Topics: etcd | Domain: Kubernetes

etcd Troubleshooting Scenarios

Overview

etcd is the brain of Kubernetes. Every resource, every secret, every config lives in etcd. When etcd is unhealthy, the entire cluster is at risk. These scenarios cover the most common etcd failures and how to diagnose and fix them.

Prerequisites

# etcdctl setup (adjust endpoints for your cluster)
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/healthcheck-client.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/healthcheck-client.key

# For k3s:
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
export ETCDCTL_CERT=/var/lib/rancher/k3s/server/tls/etcd/client.crt
export ETCDCTL_KEY=/var/lib/rancher/k3s/server/tls/etcd/client.key

Scenario 1: etcd Health Check

Task: Verify etcd cluster health and membership.

# Check endpoint health
etcdctl endpoint health --write-out=table

# Check endpoint status (leader, DB size, raft index)
etcdctl endpoint status --write-out=table

# List members
etcdctl member list --write-out=table

Expected output:

+------------------+---------+--------+---------------------------+
|    ENDPOINT      | HEALTH  | TOOK   |          ERROR            |
+------------------+---------+--------+---------------------------+
| https://...:2379 |  true   | 2.3ms  |                           |
+------------------+---------+--------+---------------------------+

Red flags: - Health = false - TOOK > 100ms (slow disk I/O) - Different raft indices across members (replication lag)


Scenario 2: etcd Database Too Large

Symptoms: API server returns "etcdserver: mvcc: database space exceeded"

Diagnosis:

# Check DB size
etcdctl endpoint status --write-out=table
# Look at DB SIZE column (default limit: 2GB)

# Check how many keys
etcdctl get / --prefix --keys-only | wc -l

# Find large key ranges
for prefix in /registry/events /registry/pods /registry/secrets /registry/configmaps; do
  count=$(etcdctl get $prefix --prefix --keys-only 2>/dev/null | wc -l)
  echo "$prefix: $count keys"
done

Fix:

# Step 1: Compact old revisions
LATEST_REV=$(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
etcdctl compact $LATEST_REV

# Step 2: Defragment
etcdctl defrag --endpoints=https://127.0.0.1:2379

# Step 3: Verify
etcdctl endpoint status --write-out=table

Prevention: Set up auto-compaction: --auto-compaction-retention=1 (1 hour)


Scenario 3: Slow etcd (High Latency)

Symptoms: kubectl commands are slow (>2s), API server timeouts

Diagnosis:

# Check disk I/O latency
etcdctl endpoint health --write-out=table
# TOOK > 100ms = slow disk

# Check etcd metrics
curl -s --cacert $ETCDCTL_CACERT --cert $ETCDCTL_CERT --key $ETCDCTL_KEY \
  https://127.0.0.1:2379/metrics | grep -E "etcd_disk_wal_fsync_duration_seconds|etcd_disk_backend_commit_duration"

# Check system disk I/O
iostat -x 1 5  # Look for high await times

Fix: - Move etcd data directory to SSD/NVMe - Use dedicated disk for etcd (not shared with other workloads) - Reduce etcd load (fewer watchers, smaller objects) - Defragment: etcdctl defrag

Prometheus alerts:

# WAL fsync latency > 100ms
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.1

# Backend commit latency > 250ms
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25


Scenario 4: etcd Member Down

Symptoms: etcdctl endpoint health shows one member unhealthy

Diagnosis:

# Check member list
etcdctl member list --write-out=table

# Check if etcd process is running
systemctl status etcd  # or check the etcd pod
kubectl get pods -n kube-system -l component=etcd

# Check etcd logs
journalctl -u etcd --since "10 min ago" --no-pager
# or
kubectl logs -n kube-system etcd-<node-name> --tail=100

Fix (member rejoining):

# If member can restart, just restart it
systemctl restart etcd

# If member data is corrupted:
# 1. Remove the member
etcdctl member remove <member-id>

# 2. Clear data directory
rm -rf /var/lib/etcd/member

# 3. Re-add the member
etcdctl member add <name> --peer-urls=https://<ip>:2380

# 4. Start etcd with --initial-cluster-state=existing


Scenario 5: etcd Backup and Restore

Backup:

# Take a snapshot
etcdctl snapshot save /tmp/etcd-backup-$(date +%Y%m%d-%H%M%S).db

# Verify snapshot
etcdctl snapshot status /tmp/etcd-backup-*.db --write-out=table

Restore (disaster recovery):

# Stop kube-apiserver and etcd
systemctl stop kube-apiserver
systemctl stop etcd

# Restore from snapshot
etcdctl snapshot restore /tmp/etcd-backup.db \
  --data-dir=/var/lib/etcd-restored \
  --name=node1 \
  --initial-cluster=node1=https://10.0.0.1:2380 \
  --initial-advertise-peer-urls=https://10.0.0.1:2380

# Replace data directory
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-restored /var/lib/etcd

# Start etcd and kube-apiserver
systemctl start etcd
systemctl start kube-apiserver

# Verify
etcdctl endpoint health
kubectl get nodes


Scenario 6: etcd Certificate Issues

Symptoms: "tls: bad certificate" or "transport: authentication handshake failed"

Diagnosis:

# Check cert expiry
openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -dates

# Check cert SAN (must include node IP)
openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -ext subjectAltName

# Check if certs match
openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -modulus | md5sum
openssl rsa -in /etc/kubernetes/pki/etcd/server.key -noout -modulus | md5sum
# Should match

Fix: Rotate certificates using kubeadm or manually regenerate with the correct SANs.


Scenario 7: Leader Election Thrashing

Symptoms: Frequent leader changes, slow API responses

Diagnosis:

# Check leader changes
etcdctl endpoint status --write-out=table
# Note which member is leader

# Monitor leader changes over time
curl -s --cacert $ETCDCTL_CACERT --cert $ETCDCTL_CERT --key $ETCDCTL_KEY \
  https://127.0.0.1:2379/metrics | grep etcd_server_leader_changes_seen_total

Common causes: - Network latency between members > election timeout - Disk I/O too slow (WAL writes taking too long) - CPU contention (etcd competing with other processes)

Fix: - Ensure dedicated resources for etcd - Increase --heartbeat-interval and --election-timeout - Move to faster disks - Reduce network latency between etcd members


Scenario 8: Watch Event Overload

Symptoms: etcd using excessive memory, slow watch responses

Diagnosis:

# Check watch count
curl -s --cacert $ETCDCTL_CACERT --cert $ETCDCTL_CERT --key $ETCDCTL_KEY \
  https://127.0.0.1:2379/metrics | grep etcd_debugging_mvcc_watcher_total

# Check number of events
curl -s --cacert $ETCDCTL_CACERT --cert $ETCDCTL_CERT --key $ETCDCTL_KEY \
  https://127.0.0.1:2379/metrics | grep etcd_debugging_mvcc_events_total

Common causes: Operators or controllers watching too many resources with tight reconciliation loops.

Fix: Identify and fix misbehaving controllers. Use label selectors to narrow watch scope.


Key etcd Metrics to Monitor

Metric What it means Alert threshold
etcd_disk_wal_fsync_duration_seconds WAL sync latency p99 > 100ms
etcd_disk_backend_commit_duration_seconds Backend commit latency p99 > 250ms
etcd_server_leader_changes_seen_total Leader election changes > 3/hour
etcd_server_proposals_failed_total Failed raft proposals > 0 sustained
etcd_mvcc_db_total_size_in_bytes Database size > 80% of quota
etcd_network_peer_round_trip_time_seconds Peer RTT p99 > 50ms

Quick etcd Health Checklist

  • All members healthy: etcdctl endpoint health
  • DB size < 2GB: etcdctl endpoint status
  • WAL fsync < 100ms: check metrics
  • Leader stable: no frequent changes
  • Backup recent: < 24 hours old
  • Certs not expiring soon: > 30 days remaining
  • Disk space available: > 20% free on etcd data volume

Wiki Navigation

Prerequisites