Skip to content

Portal | Level: L2: Operations | Topics: etcd | Domain: Kubernetes

etcd Operations Drills

Remember: Every etcdctl command in a TLS-secured cluster needs three certificate flags: --cacert (CA certificate), --cert (client certificate), --key (client key). Mnemonic: "CCK" — CA, Cert, Key. In kubeadm clusters, these live in /etc/kubernetes/pki/etcd/. Forgetting any one of the three produces a cryptic TLS handshake error.

Gotcha: Always set ETCDCTL_API=3 before running etcdctl commands. The default API version on some installations is v2, which has completely different command syntax and data model. v2 commands silently succeed but operate on a different data store, leading to confusing results.

Drill 1: Check Cluster Health

Difficulty: Easy

Q: Write the command to check the health of all etcd members in a kubeadm-managed cluster.

Answer
ETCDCTL_API=3 etcdctl endpoint health --cluster \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

Drill 2: Check DB Size

Difficulty: Easy

Q: How do you check the current etcd database size and who the leader is?

Answer
ETCDCTL_API=3 etcdctl endpoint status --cluster --write-out=table \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
Output shows: ENDPOINT, ID, VERSION, DB SIZE, IS LEADER, RAFT TERM, RAFT INDEX. Default quota is 2GB. Alert at 80% (1.6GB).

Drill 3: Snapshot Backup

Difficulty: Easy

Q: Create a timestamped etcd snapshot backup.

Answer
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-*.db --write-out=table

Drill 4: Space Exceeded Recovery

Difficulty: Hard

Q: etcd returns "mvcc: database space exceeded". kubectl commands fail. Existing pods are running. Walk through the full recovery procedure.

Answer
# 1. Get the latest revision
REV=$(ETCDCTL_API=3 etcdctl endpoint status --write-out=json \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  | jq '.[0].Status.header.revision')

# 2. Compact old revisions
ETCDCTL_API=3 etcdctl compact $REV \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 3. Defragment all members
ETCDCTL_API=3 etcdctl defrag --cluster \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 4. Disarm the space alarm
ETCDCTL_API=3 etcdctl alarm disarm \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 5. Verify
ETCDCTL_API=3 etcdctl endpoint status --cluster --write-out=table ...
kubectl get nodes  # Should work again
Key insight: etcd goes read-only, not down. Existing workloads keep running.

Drill 5: Snapshot Restore

Difficulty: Hard

Q: You have a snapshot at /backup/etcd-snapshot.db. The cluster is completely broken. Walk through the restore process for a single-node etcd.

Answer
# 1. Stop the API server and etcd
# (For kubeadm: move manifests out of /etc/kubernetes/manifests/)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/

# 2. Wait for etcd to stop
crictl ps | grep etcd  # Should show nothing

# 3. Back up current data directory
mv /var/lib/etcd /var/lib/etcd.bak

# 4. Restore snapshot to new data directory
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd \
  --name=<etcd-member-name> \
  --initial-cluster=<name>=https://<ip>:2380 \
  --initial-advertise-peer-urls=https://<ip>:2380

# 5. Restore manifests
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# 6. Wait for etcd and API server to start
crictl ps | grep etcd
kubectl get nodes

Drill 6: Find What's Consuming Space

Difficulty: Medium

Q: etcd is at 1.8GB out of 2GB quota. How do you find what's consuming the most space?

Answer
# Count keys by resource type
ETCDCTL_API=3 etcdctl get /registry --prefix --keys-only \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  | awk -F'/' '{print $3}' | sort | uniq -c | sort -rn | head -20
Common culprits: - `events` — usually the biggest (set `--event-ttl=1h` on API server) - `secrets` / `configmaps` — operators creating without cleanup - `leases` — many nodes or controllers - Large ConfigMaps with embedded files

Drill 7: Member Replacement

Difficulty: Hard

Q: An etcd member etcd-3 has a corrupted disk. How do you remove it and add a fresh member?

Answer
# 1. List members to get the ID
ETCDCTL_API=3 etcdctl member list --write-out=table ...

# 2. Remove the bad member
ETCDCTL_API=3 etcdctl member remove <member-id> ...

# 3. Add the new member
ETCDCTL_API=3 etcdctl member add etcd-3 \
  --peer-urls=https://10.0.0.3:2380 ...

# 4. On the new node, start etcd with:
#    --initial-cluster-state=existing
#    (not "new" — it's joining an existing cluster)

# 5. Verify
ETCDCTL_API=3 etcdctl member list --write-out=table ...
ETCDCTL_API=3 etcdctl endpoint health --cluster ...
Important: Never remove and add at the same time. Remove first, verify quorum, then add.

Drill 8: Performance Check

Difficulty: Medium

Q: etcd is responding slowly. What metrics and checks would you run?

Answer
# 1. Check disk fsync latency (must be < 10ms for etcd)
# PromQL:
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

# 2. Check backend commit duration
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))

# 3. Check proposal failures (should be 0)
rate(etcd_server_proposals_failed_total[5m])

# 4. Check network latency between members
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))

# 5. From the node, check disk I/O
iostat -x 1 5  # Look at await and %util for etcd's disk

# 6. Check if defragmentation is needed
# Compare DB SIZE vs DB SIZE IN USE in endpoint status
Common causes of slow etcd: - Shared disk with other I/O-heavy workloads - Network latency between members > 10ms - DB too large (needs compaction + defrag) - Too many watchers (check `etcd_debugging_mvcc_slow_watcher_total`)

Drill 9: Certificate Expiry Check

Difficulty: Easy

Q: How do you check when etcd's TLS certificates expire?

Answer
# Check etcd server cert
openssl x509 -noout -dates -in /etc/kubernetes/pki/etcd/server.crt

# Check etcd peer cert
openssl x509 -noout -dates -in /etc/kubernetes/pki/etcd/peer.crt

# Check etcd CA
openssl x509 -noout -dates -in /etc/kubernetes/pki/etcd/ca.crt

# Check all K8s certs at once (kubeadm)
kubeadm certs check-expiration
kubeadm auto-renews certs on control plane upgrade. For non-kubeadm clusters, set up monitoring:
# Alert on certs expiring within 30 days
apiserver_client_certificate_expiration_seconds_bucket{le="2592000"} > 0

Drill 10: Automated Backup CronJob

Difficulty: Medium

Q: Write a Kubernetes CronJob that backs up etcd every 6 hours and retains the last 7 days of backups.

Answer
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          tolerations:
          - effect: NoSchedule
            operator: Exists
          containers:
          - name: backup
            image: registry.k8s.io/etcd:3.5.12-0
            command:
            - /bin/sh
            - -c
            - |
              etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
                --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
                --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
              # Clean up backups older than 7 days
              find /backup -name "etcd-*.db" -mtime +7 -delete
            env:
            - name: ETCDCTL_API
              value: "3"
            volumeMounts:
            - name: etcd-certs
              mountPath: /etc/kubernetes/pki/etcd
              readOnly: true
            - name: backup
              mountPath: /backup
          restartPolicy: OnFailure
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
          - name: backup
            persistentVolumeClaim:
              claimName: etcd-backup-pvc

Wiki Navigation

Prerequisites