Skip to content

Runbook: etcd High Latency / Slow Operations

Field Value
Domain Kubernetes
Alert etcd_disk_wal_fsync_duration_seconds_bucket > 10ms or etcd backend commit latency high
Severity P1
Est. Resolution Time 30-60 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, cluster-admin or namespace-admin, kubeconfig configured

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get pods -n kube-system -l component=etcd
If output shows: all etcd pods running → Etcd is up but slow; continue with steps below If output shows: one or more etcd pods not running → This is an etcd availability issue, not just latency; escalate immediately — etcd quorum loss can cause cluster-wide API failures

Step 1: Check etcd Metrics for Latency

Why: etcd exposes Prometheus metrics that show exactly which operation type is slow (WAL fsync, backend commit, apply). This directs you to the right fix without guessing.

# Port-forward to an etcd pod to access its metrics endpoint
kubectl port-forward -n kube-system <ETCD_POD_NAME> 2381:2381 &

# Check latency metrics (99th percentile should be < 10ms for WAL fsync)
curl -s http://localhost:2381/metrics | grep -E "etcd_disk_wal_fsync|etcd_disk_backend_commit|etcd_server_proposals"

# Check current leader and member list
kubectl exec -n kube-system <ETCD_POD_NAME> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --write-out=table

# Kill port-forward when done
kill %1
Expected output (healthy etcd):
+------------------+------------------+---------+---------+-----------+
|     ENDPOINT     |        ID        | VERSION | DB SIZE | IS LEADER |
+------------------+------------------+---------+---------+-----------+
| https://...:2379 | 8e9e05c52164694d | 3.5.9   | 45 MB   | true      |
+------------------+------------------+---------+---------+-----------+
Expected output (high latency metrics — values in seconds):
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.01"} 2453   <-- most writes complete in < 10ms (good)
etcd_disk_wal_fsync_duration_seconds_bucket{le="1"}    4821   <-- many writes taking up to 1s (bad)
If WAL fsync latency > 10ms at the 99th percentile: The disk on etcd nodes is too slow — continue to Step 2. If backend commit latency is high but WAL fsync is fine: etcd is compacting or has a large DB — continue to Step 4.

Step 2: Check Disk I/O on etcd Nodes

Why: etcd is extremely sensitive to disk latency. A disk that is too slow, shared with other workloads, or in a degraded state will cause etcd to exceed its fsync timeout and start losing performance. etcd requires low-latency NVMe or SSD — spinning disks or network-attached storage (EBS gp2, NFS) are common culprits.

# SSH into the etcd node
ssh <SSH_USER>@<ETCD_NODE_IP>

# Check disk I/O wait (anything > 5% is concerning for etcd)
iostat -x 1 5

# Check which processes are using the disk
iotop -o -n 5

# Measure raw disk latency with a simple write test
dd if=/dev/zero of=/var/lib/etcd/test-latency bs=4096 count=1000 oflag=dsync 2>&1
rm -f /var/lib/etcd/test-latency
Expected output (good disk — from dd test):
1000+0 records in
1000+0 records out
4096000 bytes (4.1 MB, 3.9 MiB) copied, 0.52 s, 7.9 MB/s
If throughput is < 5 MB/s or I/O wait is consistently high: The disk is too slow for etcd. - Cloud: Upgrade the EBS volume from gp2 to gp3 (or io1 for dedicated IOPS) — this usually requires platform team action. - On-prem: The disk may be failing — check SMART data with smartctl -a /dev/sda. If another process is saturating the disk (from iotop): Identify and stop it — etcd must have exclusive or priority access to its disk.

Step 3: Check etcd Member Health

Why: An unhealthy or lagging etcd member causes the cluster to spend time waiting for replicas to catch up. A member that fell behind due to a restart or network partition will create persistent latency until it catches up or is removed.

# Check all member health
kubectl exec -n kube-system <ETCD_POD_NAME> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health --cluster --write-out=table

# Check for leader changes (frequent leader changes = instability)
kubectl logs -n kube-system <ETCD_POD_NAME> --since=30m | grep -E "leader|election|timeout"
Expected output (all members healthy):
+------------------+--------+------------+-------+
|     ENDPOINT     | HEALTH |    TOOK    | ERROR |
+------------------+--------+------------+-------+
| https://...:2379 | true   | 4.123ms    |       |
| https://...:2379 | true   | 5.456ms    |       |
| https://...:2379 | true   | 4.789ms    |       |
+------------------+--------+------------+-------+
If one member shows false or TOOK > 1s: That member is unhealthy. Check its logs:
kubectl logs -n kube-system <UNHEALTHY_ETCD_POD_NAME> --tail=50
If the member is consistently behind: It may need to be removed and re-joined. This requires platform team involvement — escalate.

Step 4: Check for Large Objects and Watch Streams

Why: etcd's performance degrades when the database grows large (>2GB is a warning sign, >8GB approaches the default quota limit of 8GB). Large objects (e.g., giant ConfigMaps storing helm release state) and excessive watch streams (common with many operators) both slow etcd.

# Check etcd DB size
kubectl exec -n kube-system <ETCD_POD_NAME> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --write-out=table

# Check number of keys per prefix (find large key namespaces)
kubectl exec -n kube-system <ETCD_POD_NAME> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  get / --prefix --keys-only | sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head -20

# Check for excessively large individual keys (Helm releases stored as secrets are common culprits)
kubectl get secrets -A | grep helm | wc -l
Expected output (DB size):
DB SIZE: 245 MB   <-- healthy
DB SIZE: 5.8 GB   <-- large, defrag likely needed
If DB size > 2GB: Continue to Step 5 to defragment. If Helm secrets count is > 500: Old Helm release history is bloating etcd — prune with helm history <RELEASE_NAME> -n <NAMESPACE> and helm uninstall old releases, or configure --history-max on the Helm operator.

Step 5: Defragment etcd If Needed

Why: etcd does not return disk space to the OS automatically after deleting keys. Over time, the on-disk representation becomes fragmented, increasing latency. Defragmenting compacts the DB and reclaims space. CRITICAL: Defragment one member at a time. Defragging all members simultaneously will cause a quorum loss and an outage.

# Get the list of all etcd endpoints
kubectl get pods -n kube-system -l component=etcd -o wide

# Defragment ONE member at a time — start with a non-leader
# (Check which is the leader from Step 1 output and do that one last)
kubectl exec -n kube-system <ETCD_POD_NAME_NON_LEADER_1> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  defrag

# Wait for defrag to complete (check DB size returned to normal) before doing next member
# Repeat for each non-leader, then finally the leader

# After each defrag, verify the member is still healthy
kubectl exec -n kube-system <ETCD_POD_NAME_NON_LEADER_1> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health
Expected output (defrag complete):
Finished defragmenting etcd member[https://127.0.0.1:2379]
If defrag fails with a timeout: The member is too busy or the disk is too slow — the defrag may retry automatically. Wait 2 minutes and retry once. If it fails twice, escalate. Do NOT run defrag with --cluster flag — that defragments all members simultaneously and can cause quorum loss.

Step 6: Check Network Between etcd Members

Why: etcd uses Raft consensus and requires low-latency communication between all members. High network latency (>5ms between members in the same datacenter) or packet loss causes election timeouts, leader churn, and perceived high latency in API operations.

# SSH into one etcd node and ping the others
ssh <SSH_USER>@<ETCD_NODE_1_IP>
ping -c 20 <ETCD_NODE_2_IP>
ping -c 20 <ETCD_NODE_3_IP>

# Check for packet loss
mtr --report --report-cycles 20 <ETCD_NODE_2_IP>

# Check network interface errors
ip -s link show <NETWORK_INTERFACE>
Expected output (healthy network):
--- 10.0.1.5 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss
rtt min/avg/max/mdev = 0.234/0.312/0.456/0.045 ms
If latency > 2ms average between etcd nodes in the same AZ: There may be a network issue — check security groups, network ACLs, and whether etcd nodes are in the same placement group. If packet loss > 0%: Network instability is causing etcd leader elections and high latency — escalate to the network/platform team immediately.

Verification

# Confirm etcd latency has returned to normal
kubectl port-forward -n kube-system <ETCD_POD_NAME> 2381:2381 &
curl -s http://localhost:2381/metrics | grep "etcd_disk_wal_fsync_duration_seconds_bucket" | grep 'le="0.01"'
kill %1

# Confirm Kubernetes API responsiveness
time kubectl get nodes
time kubectl get pods -A
Success looks like: WAL fsync p99 < 10ms, kubectl get nodes responds in < 2 seconds, and no new API server timeout errors in kubectl logs -n kube-system kube-apiserver-<NODE_NAME>. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min SRE on-call "Kubernetes etcd high latency in , WAL fsync p99 at ms, API degraded, runbook exhausted"
Data loss suspected Platform Lead "Data loss risk: etcd member unhealthy, possible data inconsistency between members"
Scope expanding beyond namespace Platform team "Cluster-wide impact: etcd latency causing API server errors for all namespaces, potential quorum risk"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2 (etcd incidents almost always are)
  • Update this runbook if steps were wrong or incomplete
  • Review disk type on etcd nodes — upgrade to NVMe SSD if on spinning disk or gp2 EBS
  • Set up automated etcd defrag job to run weekly (before DB grows large)
  • Review etcd DB size growth trend in Grafana and set a proactive alert at 2GB
  • Prune Helm history and check for other large object contributors if DB was bloated

Common Mistakes

  1. Defragmenting all etcd members simultaneously: This is the most dangerous mistake in this runbook. Running etcdctl defrag --cluster or running defrag on all pods at once causes all members to be unavailable simultaneously, losing quorum, and bringing down the Kubernetes API server. Always defrag one member at a time, confirm it is healthy, then proceed to the next. Leave the current leader for last.
  2. Not checking disk type — etcd requires low-latency SSD: etcd's performance is entirely bound by disk write latency. Engineers often diagnose etcd latency as a "software problem" when it is simply that the etcd node was provisioned on an AWS gp2 EBS volume (which has variable IOPS and high latency variance) or on a shared disk. The first question to ask is: what type of disk is the etcd node using? Any spinning disk or general-purpose cloud volume will cause intermittent latency spikes. Dedicated NVMe SSD with consistent sub-1ms latency is required for production etcd.
  3. Ignoring etcd DB size growth: Engineers treat etcd latency as a one-time incident to resolve and move on without checking why the DB grew large. If Helm history, events, or stale secrets are accumulating, the DB will fill up again within weeks. Always identify and address the growth cause after a defrag.

Cross-References


Wiki Navigation