- k8s
- l3
- runbook
- etcd --- Portal | Level: L3: Advanced | Topics: etcd | Domain: Kubernetes
Runbook: etcd High Latency / Slow Operations¶
| Field | Value |
|---|---|
| Domain | Kubernetes |
| Alert | etcd_disk_wal_fsync_duration_seconds_bucket > 10ms or etcd backend commit latency high |
| Severity | P1 |
| Est. Resolution Time | 30-60 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, cluster-admin or namespace-admin, kubeconfig configured |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl get pods -n kube-system -l component=etcd
Step 1: Check etcd Metrics for Latency¶
Why: etcd exposes Prometheus metrics that show exactly which operation type is slow (WAL fsync, backend commit, apply). This directs you to the right fix without guessing.
# Port-forward to an etcd pod to access its metrics endpoint
kubectl port-forward -n kube-system <ETCD_POD_NAME> 2381:2381 &
# Check latency metrics (99th percentile should be < 10ms for WAL fsync)
curl -s http://localhost:2381/metrics | grep -E "etcd_disk_wal_fsync|etcd_disk_backend_commit|etcd_server_proposals"
# Check current leader and member list
kubectl exec -n kube-system <ETCD_POD_NAME> -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status --write-out=table
# Kill port-forward when done
kill %1
+------------------+------------------+---------+---------+-----------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER |
+------------------+------------------+---------+---------+-----------+
| https://...:2379 | 8e9e05c52164694d | 3.5.9 | 45 MB | true |
+------------------+------------------+---------+---------+-----------+
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.01"} 2453 <-- most writes complete in < 10ms (good)
etcd_disk_wal_fsync_duration_seconds_bucket{le="1"} 4821 <-- many writes taking up to 1s (bad)
Step 2: Check Disk I/O on etcd Nodes¶
Why: etcd is extremely sensitive to disk latency. A disk that is too slow, shared with other workloads, or in a degraded state will cause etcd to exceed its fsync timeout and start losing performance. etcd requires low-latency NVMe or SSD — spinning disks or network-attached storage (EBS gp2, NFS) are common culprits.
# SSH into the etcd node
ssh <SSH_USER>@<ETCD_NODE_IP>
# Check disk I/O wait (anything > 5% is concerning for etcd)
iostat -x 1 5
# Check which processes are using the disk
iotop -o -n 5
# Measure raw disk latency with a simple write test
dd if=/dev/zero of=/var/lib/etcd/test-latency bs=4096 count=1000 oflag=dsync 2>&1
rm -f /var/lib/etcd/test-latency
smartctl -a /dev/sda.
If another process is saturating the disk (from iotop): Identify and stop it — etcd must have exclusive or priority access to its disk.
Step 3: Check etcd Member Health¶
Why: An unhealthy or lagging etcd member causes the cluster to spend time waiting for replicas to catch up. A member that fell behind due to a restart or network partition will create persistent latency until it catches up or is removed.
# Check all member health
kubectl exec -n kube-system <ETCD_POD_NAME> -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health --cluster --write-out=table
# Check for leader changes (frequent leader changes = instability)
kubectl logs -n kube-system <ETCD_POD_NAME> --since=30m | grep -E "leader|election|timeout"
+------------------+--------+------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+------------------+--------+------------+-------+
| https://...:2379 | true | 4.123ms | |
| https://...:2379 | true | 5.456ms | |
| https://...:2379 | true | 4.789ms | |
+------------------+--------+------------+-------+
false or TOOK > 1s: That member is unhealthy. Check its logs:
If the member is consistently behind: It may need to be removed and re-joined. This requires platform team involvement — escalate.
Step 4: Check for Large Objects and Watch Streams¶
Why: etcd's performance degrades when the database grows large (>2GB is a warning sign, >8GB approaches the default quota limit of 8GB). Large objects (e.g., giant ConfigMaps storing helm release state) and excessive watch streams (common with many operators) both slow etcd.
# Check etcd DB size
kubectl exec -n kube-system <ETCD_POD_NAME> -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status --write-out=table
# Check number of keys per prefix (find large key namespaces)
kubectl exec -n kube-system <ETCD_POD_NAME> -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
get / --prefix --keys-only | sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head -20
# Check for excessively large individual keys (Helm releases stored as secrets are common culprits)
kubectl get secrets -A | grep helm | wc -l
helm history <RELEASE_NAME> -n <NAMESPACE> and helm uninstall old releases, or configure --history-max on the Helm operator.
Step 5: Defragment etcd If Needed¶
Why: etcd does not return disk space to the OS automatically after deleting keys. Over time, the on-disk representation becomes fragmented, increasing latency. Defragmenting compacts the DB and reclaims space. CRITICAL: Defragment one member at a time. Defragging all members simultaneously will cause a quorum loss and an outage.
# Get the list of all etcd endpoints
kubectl get pods -n kube-system -l component=etcd -o wide
# Defragment ONE member at a time — start with a non-leader
# (Check which is the leader from Step 1 output and do that one last)
kubectl exec -n kube-system <ETCD_POD_NAME_NON_LEADER_1> -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
defrag
# Wait for defrag to complete (check DB size returned to normal) before doing next member
# Repeat for each non-leader, then finally the leader
# After each defrag, verify the member is still healthy
kubectl exec -n kube-system <ETCD_POD_NAME_NON_LEADER_1> -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
--cluster flag — that defragments all members simultaneously and can cause quorum loss.
Step 6: Check Network Between etcd Members¶
Why: etcd uses Raft consensus and requires low-latency communication between all members. High network latency (>5ms between members in the same datacenter) or packet loss causes election timeouts, leader churn, and perceived high latency in API operations.
# SSH into one etcd node and ping the others
ssh <SSH_USER>@<ETCD_NODE_1_IP>
ping -c 20 <ETCD_NODE_2_IP>
ping -c 20 <ETCD_NODE_3_IP>
# Check for packet loss
mtr --report --report-cycles 20 <ETCD_NODE_2_IP>
# Check network interface errors
ip -s link show <NETWORK_INTERFACE>
--- 10.0.1.5 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss
rtt min/avg/max/mdev = 0.234/0.312/0.456/0.045 ms
Verification¶
# Confirm etcd latency has returned to normal
kubectl port-forward -n kube-system <ETCD_POD_NAME> 2381:2381 &
curl -s http://localhost:2381/metrics | grep "etcd_disk_wal_fsync_duration_seconds_bucket" | grep 'le="0.01"'
kill %1
# Confirm Kubernetes API responsiveness
time kubectl get nodes
time kubectl get pods -A
kubectl get nodes responds in < 2 seconds, and no new API server timeout errors in kubectl logs -n kube-system kube-apiserver-<NODE_NAME>.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | SRE on-call | "Kubernetes etcd high latency in |
| Data loss suspected | Platform Lead | "Data loss risk: etcd member unhealthy, possible data inconsistency between members" |
| Scope expanding beyond namespace | Platform team | "Cluster-wide impact: etcd latency causing API server errors for all namespaces, potential quorum risk" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2 (etcd incidents almost always are)
- Update this runbook if steps were wrong or incomplete
- Review disk type on etcd nodes — upgrade to NVMe SSD if on spinning disk or gp2 EBS
- Set up automated etcd defrag job to run weekly (before DB grows large)
- Review etcd DB size growth trend in Grafana and set a proactive alert at 2GB
- Prune Helm history and check for other large object contributors if DB was bloated
Common Mistakes¶
- Defragmenting all etcd members simultaneously: This is the most dangerous mistake in this runbook. Running
etcdctl defrag --clusteror running defrag on all pods at once causes all members to be unavailable simultaneously, losing quorum, and bringing down the Kubernetes API server. Always defrag one member at a time, confirm it is healthy, then proceed to the next. Leave the current leader for last. - Not checking disk type — etcd requires low-latency SSD: etcd's performance is entirely bound by disk write latency. Engineers often diagnose etcd latency as a "software problem" when it is simply that the etcd node was provisioned on an AWS gp2 EBS volume (which has variable IOPS and high latency variance) or on a shared disk. The first question to ask is: what type of disk is the etcd node using? Any spinning disk or general-purpose cloud volume will cause intermittent latency spikes. Dedicated NVMe SSD with consistent sub-1ms latency is required for production etcd.
- Ignoring etcd DB size growth: Engineers treat etcd latency as a one-time incident to resolve and move on without checking why the DB grew large. If Helm history, events, or stale secrets are accumulating, the DB will fill up again within weeks. Always identify and address the growth cause after a defrag.
Cross-References¶
- Survival Guide: On-Call Survival Guide (pocket card version)
- Topic Pack: Kubernetes Topics (deep background)
- Related Runbook: node-not-ready.md — if control plane nodes are unhealthy
- Related Runbook: deploy-stuck.md — if API slowness is preventing deployments from completing
- Related Runbook: pod-crashloop.md — if API latency is causing readiness probe timeouts and CrashLoops
Wiki Navigation¶
Related Content¶
- Interview: etcd Space Exceeded (Scenario, L3) — etcd
- Runbook: etcd Backup & Restore (Runbook, L2) — etcd
- Scenario: etcd Troubleshooting (Scenario, L3) — etcd
- Skillcheck: etcd (Assessment, L2) — etcd
- etcd (Topic Pack, L1) — etcd
- etcd Drills (Drill, L2) — etcd
- etcd Flashcards (CLI) (flashcard_deck, L1) — etcd