etcd Operations Cheat Sheet¶
Name origin: etcd = "et cetera daemon" — it stores configuration data, like the
/etcdirectory in Unix but as a distributed, consistent key-value store. Created by CoreOS (now part of Red Hat) in 2013. It uses the Raft consensus algorithm, which guarantees that a majority of nodes (quorum) must agree on every write before it is committed.
Architecture¶
┌──────────┐ ┌──────────┐ ┌──────────┐
│ etcd-1 │◄─┤ etcd-2 │◄─┤ etcd-3 │
│ (leader) │──►│(follower)│──►│(follower)│
└────┬─────┘ └──────────┘ └──────────┘
│ Raft consensus
▼
┌──────────────┐
│ kube-apiserver│
└──────────────┘
- Minimum 3 members for HA (tolerates 1 failure)
- 5 members tolerates 2 failures
- Quorum = (n/2) + 1
Common Cert Paths (kubeadm)¶
CA_CERT=/etc/kubernetes/pki/etcd/ca.crt
CLIENT_CERT=/etc/kubernetes/pki/etcd/healthcheck-client.crt
CLIENT_KEY=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Alias for convenience
alias etcdctl='ETCDCTL_API=3 etcdctl \
--cacert=$CA_CERT --cert=$CLIENT_CERT --key=$CLIENT_KEY'
Health & Status¶
# Endpoint health
etcdctl endpoint health --cluster
# Endpoint status (DB size, leader, raft index)
etcdctl endpoint status --cluster --write-out=table
# Member list
etcdctl member list --write-out=table
# Check leader
etcdctl endpoint status --write-out=json | jq '.[].Status.leader'
# Alarm list (check for space alarms)
etcdctl alarm list
Backup & Restore¶
# Snapshot backup
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
# Verify snapshot
etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table
# Restore (STOP etcd first!)
etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restored \
--name=etcd-1 \
--initial-cluster=etcd-1=https://10.0.0.1:2380 \
--initial-advertise-peer-urls=https://10.0.0.1:2380
# Then update etcd config to use --data-dir=/var/lib/etcd-restored
Gotcha: etcd snapshot restore creates a new data directory — it does not restore in place. You must stop etcd, point it to the new
--data-dir, and restart. For multi-node clusters, you must restore the same snapshot on ALL members with unique--nameand--initial-clusterflags. Restoring only one node creates a split-brain situation.
DB Space Exceeded Recovery¶
# 1. Check current alarm
etcdctl alarm list
# → memberID:xxx alarm:NOSPACE
# 2. Get latest revision
REV=$(etcdctl endpoint status --write-out=json | \
jq '.[0].Status.header.revision')
# 3. Compact
etcdctl compact $REV
# 4. Defragment
etcdctl defrag --cluster
# 5. Disarm alarm
etcdctl alarm disarm
# 6. Verify
etcdctl endpoint status --write-out=table
Debug clue: The
NOSPACEalarm is the #1 etcd emergency in production Kubernetes. When etcd hits its storage quota (default 2GB), it switches to read-only mode and the API server cannot create, update, or delete any resources. The five-step fix is: check alarm, get revision, compact, defrag, disarm. Memorize this sequence — you will need it at 3 AM.
Performance Tuning¶
# Check disk latency (etcd needs < 10ms fsync)
fio --name=etcd-test --ioengine=sync --rw=write --bs=2300 \
--numjobs=1 --size=22m --runtime=60 --time_based
# Key metrics to monitor
etcd_disk_wal_fsync_duration_seconds # Should be < 10ms p99
etcd_disk_backend_commit_duration_seconds
etcd_server_proposals_failed_total # Should be 0
etcd_network_peer_round_trip_time_seconds
etcd_mvcc_db_total_size_in_bytes # Monitor against quota
Member Management¶
# Add a new member
etcdctl member add etcd-4 --peer-urls=https://10.0.0.4:2380
# Remove a member
etcdctl member remove <member-id>
# Update peer URLs
etcdctl member update <member-id> --peer-urls=https://new-ip:2380
Common Key Prefixes in Kubernetes¶
# List all keys (careful in production!)
etcdctl get / --prefix --keys-only --limit=20
# Common prefixes:
# /registry/pods/
# /registry/services/
# /registry/deployments/
# /registry/secrets/
# /registry/configmaps/
# /registry/events/ ← often the biggest
# Count keys by type
etcdctl get /registry --prefix --keys-only | \
awk -F'/' '{print $3}' | sort | uniq -c | sort -rn | head
Quota & Compaction Settings¶
--quota-backend-bytes=8589934592 # 8GB (default 2GB)
--auto-compaction-retention=1 # Keep 1 hour of history
--auto-compaction-mode=periodic # or "revision"
Troubleshooting Quick Reference¶
| Symptom | Likely Cause | Action |
|---|---|---|
database space exceeded |
No compaction, too many events | Compact + defrag + alarm disarm |
leader changed repeatedly |
Disk too slow, network issues | Check fsync latency, check network |
request timed out |
Overloaded, slow disk | Check disk I/O, reduce watch count |
member not found |
Stale member list | member list, remove stale entries |
permission denied |
Wrong certs | Verify cert paths, check expiry |