Skip to content

etcd Operations Cheat Sheet

Name origin: etcd = "et cetera daemon" — it stores configuration data, like the /etc directory in Unix but as a distributed, consistent key-value store. Created by CoreOS (now part of Red Hat) in 2013. It uses the Raft consensus algorithm, which guarantees that a majority of nodes (quorum) must agree on every write before it is committed.

Architecture

┌──────────┐  ┌──────────┐  ┌──────────┐
│  etcd-1  │◄─┤  etcd-2  │◄─┤  etcd-3  │
│ (leader) │──►│(follower)│──►│(follower)│
└────┬─────┘  └──────────┘  └──────────┘
     │ Raft consensus
┌──────────────┐
│ kube-apiserver│
└──────────────┘
  • Minimum 3 members for HA (tolerates 1 failure)
  • 5 members tolerates 2 failures
  • Quorum = (n/2) + 1

Common Cert Paths (kubeadm)

CA_CERT=/etc/kubernetes/pki/etcd/ca.crt
CLIENT_CERT=/etc/kubernetes/pki/etcd/healthcheck-client.crt
CLIENT_KEY=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Alias for convenience
alias etcdctl='ETCDCTL_API=3 etcdctl \
  --cacert=$CA_CERT --cert=$CLIENT_CERT --key=$CLIENT_KEY'

Health & Status

# Endpoint health
etcdctl endpoint health --cluster

# Endpoint status (DB size, leader, raft index)
etcdctl endpoint status --cluster --write-out=table

# Member list
etcdctl member list --write-out=table

# Check leader
etcdctl endpoint status --write-out=json | jq '.[].Status.leader'

# Alarm list (check for space alarms)
etcdctl alarm list

Backup & Restore

# Snapshot backup
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db

# Verify snapshot
etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table

# Restore (STOP etcd first!)
etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restored \
  --name=etcd-1 \
  --initial-cluster=etcd-1=https://10.0.0.1:2380 \
  --initial-advertise-peer-urls=https://10.0.0.1:2380

# Then update etcd config to use --data-dir=/var/lib/etcd-restored

Gotcha: etcd snapshot restore creates a new data directory — it does not restore in place. You must stop etcd, point it to the new --data-dir, and restart. For multi-node clusters, you must restore the same snapshot on ALL members with unique --name and --initial-cluster flags. Restoring only one node creates a split-brain situation.

DB Space Exceeded Recovery

# 1. Check current alarm
etcdctl alarm list
# → memberID:xxx alarm:NOSPACE

# 2. Get latest revision
REV=$(etcdctl endpoint status --write-out=json | \
  jq '.[0].Status.header.revision')

# 3. Compact
etcdctl compact $REV

# 4. Defragment
etcdctl defrag --cluster

# 5. Disarm alarm
etcdctl alarm disarm

# 6. Verify
etcdctl endpoint status --write-out=table

Debug clue: The NOSPACE alarm is the #1 etcd emergency in production Kubernetes. When etcd hits its storage quota (default 2GB), it switches to read-only mode and the API server cannot create, update, or delete any resources. The five-step fix is: check alarm, get revision, compact, defrag, disarm. Memorize this sequence — you will need it at 3 AM.

Performance Tuning

# Check disk latency (etcd needs < 10ms fsync)
fio --name=etcd-test --ioengine=sync --rw=write --bs=2300 \
  --numjobs=1 --size=22m --runtime=60 --time_based

# Key metrics to monitor
etcd_disk_wal_fsync_duration_seconds    # Should be < 10ms p99
etcd_disk_backend_commit_duration_seconds
etcd_server_proposals_failed_total       # Should be 0
etcd_network_peer_round_trip_time_seconds
etcd_mvcc_db_total_size_in_bytes        # Monitor against quota

Member Management

# Add a new member
etcdctl member add etcd-4 --peer-urls=https://10.0.0.4:2380

# Remove a member
etcdctl member remove <member-id>

# Update peer URLs
etcdctl member update <member-id> --peer-urls=https://new-ip:2380

Common Key Prefixes in Kubernetes

# List all keys (careful in production!)
etcdctl get / --prefix --keys-only --limit=20

# Common prefixes:
# /registry/pods/
# /registry/services/
# /registry/deployments/
# /registry/secrets/
# /registry/configmaps/
# /registry/events/          ← often the biggest

# Count keys by type
etcdctl get /registry --prefix --keys-only | \
  awk -F'/' '{print $3}' | sort | uniq -c | sort -rn | head

Quota & Compaction Settings

--quota-backend-bytes=8589934592    # 8GB (default 2GB)
--auto-compaction-retention=1       # Keep 1 hour of history
--auto-compaction-mode=periodic     # or "revision"

Troubleshooting Quick Reference

Symptom Likely Cause Action
database space exceeded No compaction, too many events Compact + defrag + alarm disarm
leader changed repeatedly Disk too slow, network issues Check fsync latency, check network
request timed out Overloaded, slow disk Check disk I/O, reduce watch count
member not found Stale member list member list, remove stale entries
permission denied Wrong certs Verify cert paths, check expiry