Portal | Level: L2: Operations | Topics: etcd | Domain: Kubernetes
etcd Operations Drills¶
Remember: Every etcdctl command in a TLS-secured cluster needs three certificate flags:
--cacert(CA certificate),--cert(client certificate),--key(client key). Mnemonic: "CCK" — CA, Cert, Key. In kubeadm clusters, these live in/etc/kubernetes/pki/etcd/. Forgetting any one of the three produces a cryptic TLS handshake error.Gotcha: Always set
ETCDCTL_API=3before running etcdctl commands. The default API version on some installations is v2, which has completely different command syntax and data model. v2 commands silently succeed but operate on a different data store, leading to confusing results.
Drill 1: Check Cluster Health¶
Difficulty: Easy
Q: Write the command to check the health of all etcd members in a kubeadm-managed cluster.
Answer
Drill 2: Check DB Size¶
Difficulty: Easy
Q: How do you check the current etcd database size and who the leader is?
Answer
Output shows: ENDPOINT, ID, VERSION, DB SIZE, IS LEADER, RAFT TERM, RAFT INDEX. Default quota is 2GB. Alert at 80% (1.6GB).Drill 3: Snapshot Backup¶
Difficulty: Easy
Q: Create a timestamped etcd snapshot backup.
Answer
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-*.db --write-out=table
Drill 4: Space Exceeded Recovery¶
Difficulty: Hard
Q: etcd returns "mvcc: database space exceeded". kubectl commands fail. Existing pods are running. Walk through the full recovery procedure.
Answer
# 1. Get the latest revision
REV=$(ETCDCTL_API=3 etcdctl endpoint status --write-out=json \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
| jq '.[0].Status.header.revision')
# 2. Compact old revisions
ETCDCTL_API=3 etcdctl compact $REV \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# 3. Defragment all members
ETCDCTL_API=3 etcdctl defrag --cluster \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# 4. Disarm the space alarm
ETCDCTL_API=3 etcdctl alarm disarm \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# 5. Verify
ETCDCTL_API=3 etcdctl endpoint status --cluster --write-out=table ...
kubectl get nodes # Should work again
Drill 5: Snapshot Restore¶
Difficulty: Hard
Q: You have a snapshot at /backup/etcd-snapshot.db. The cluster is completely broken. Walk through the restore process for a single-node etcd.
Answer
# 1. Stop the API server and etcd
# (For kubeadm: move manifests out of /etc/kubernetes/manifests/)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# 2. Wait for etcd to stop
crictl ps | grep etcd # Should show nothing
# 3. Back up current data directory
mv /var/lib/etcd /var/lib/etcd.bak
# 4. Restore snapshot to new data directory
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=<etcd-member-name> \
--initial-cluster=<name>=https://<ip>:2380 \
--initial-advertise-peer-urls=https://<ip>:2380
# 5. Restore manifests
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# 6. Wait for etcd and API server to start
crictl ps | grep etcd
kubectl get nodes
Drill 6: Find What's Consuming Space¶
Difficulty: Medium
Q: etcd is at 1.8GB out of 2GB quota. How do you find what's consuming the most space?
Answer
# Count keys by resource type
ETCDCTL_API=3 etcdctl get /registry --prefix --keys-only \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
| awk -F'/' '{print $3}' | sort | uniq -c | sort -rn | head -20
Drill 7: Member Replacement¶
Difficulty: Hard
Q: An etcd member etcd-3 has a corrupted disk. How do you remove it and add a fresh member?
Answer
# 1. List members to get the ID
ETCDCTL_API=3 etcdctl member list --write-out=table ...
# 2. Remove the bad member
ETCDCTL_API=3 etcdctl member remove <member-id> ...
# 3. Add the new member
ETCDCTL_API=3 etcdctl member add etcd-3 \
--peer-urls=https://10.0.0.3:2380 ...
# 4. On the new node, start etcd with:
# --initial-cluster-state=existing
# (not "new" — it's joining an existing cluster)
# 5. Verify
ETCDCTL_API=3 etcdctl member list --write-out=table ...
ETCDCTL_API=3 etcdctl endpoint health --cluster ...
Drill 8: Performance Check¶
Difficulty: Medium
Q: etcd is responding slowly. What metrics and checks would you run?
Answer
# 1. Check disk fsync latency (must be < 10ms for etcd)
# PromQL:
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
# 2. Check backend commit duration
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
# 3. Check proposal failures (should be 0)
rate(etcd_server_proposals_failed_total[5m])
# 4. Check network latency between members
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))
# 5. From the node, check disk I/O
iostat -x 1 5 # Look at await and %util for etcd's disk
# 6. Check if defragmentation is needed
# Compare DB SIZE vs DB SIZE IN USE in endpoint status
Drill 9: Certificate Expiry Check¶
Difficulty: Easy
Q: How do you check when etcd's TLS certificates expire?
Answer
# Check etcd server cert
openssl x509 -noout -dates -in /etc/kubernetes/pki/etcd/server.crt
# Check etcd peer cert
openssl x509 -noout -dates -in /etc/kubernetes/pki/etcd/peer.crt
# Check etcd CA
openssl x509 -noout -dates -in /etc/kubernetes/pki/etcd/ca.crt
# Check all K8s certs at once (kubeadm)
kubeadm certs check-expiration
Drill 10: Automated Backup CronJob¶
Difficulty: Medium
Q: Write a Kubernetes CronJob that backs up etcd every 6 hours and retains the last 7 days of backups.
Answer
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *"
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
hostNetwork: true
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- effect: NoSchedule
operator: Exists
containers:
- name: backup
image: registry.k8s.io/etcd:3.5.12-0
command:
- /bin/sh
- -c
- |
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Clean up backups older than 7 days
find /backup -name "etcd-*.db" -mtime +7 -delete
env:
- name: ETCDCTL_API
value: "3"
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup
mountPath: /backup
restartPolicy: OnFailure
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup
persistentVolumeClaim:
claimName: etcd-backup-pvc
Wiki Navigation¶
Prerequisites¶
- Kubernetes Ops (Production) (Topic Pack, L2)
Related Content¶
- Interview: etcd Space Exceeded (Scenario, L3) — etcd
- Runbook: etcd Backup & Restore (Runbook, L2) — etcd
- Runbook: etcd High Latency / Slow Operations (Runbook, L3) — etcd
- Scenario: etcd Troubleshooting (Scenario, L3) — etcd
- Skillcheck: etcd (Assessment, L2) — etcd
- etcd (Topic Pack, L1) — etcd
- etcd Flashcards (CLI) (flashcard_deck, L1) — etcd