etcd Footguns¶

Mistakes that cause cluster-wide outages, data loss, or silent degradation of Kubernetes control plane operations.

1. No backups¶

You run a 3-node etcd cluster with no snapshot schedule. All three nodes lose their disks in a datacenter event. All Kubernetes cluster state is gone. Every deployment, service, secret, and RBAC rule must be recreated from scratch.

What happens: Total, unrecoverable cluster state loss.

Why: etcd is the single source of truth for all Kubernetes objects. No backup means no recovery.

How to avoid: Snapshot hourly, store off-cluster (S3, GCS, NFS), retain 7 days minimum. Test restores regularly. An untested backup is not a backup.

2. Running etcd on spinning disks¶

Under load, WAL fsync latency spikes above 100ms. The leader cannot replicate fast enough. Followers time out, trigger a new election. The new leader also struggles. The cluster enters an election loop. The API server returns errors.

What happens: Repeated leader elections, API server latency spikes, and potential quorum loss.

Why: etcd requires fast synchronous writes. Spinning disks cannot sustain the fsync latency etcd needs under write load.

How to avoid: Use SSDs (NVMe preferred). Dedicate the disk to etcd only. Monitor wal_fsync_duration_seconds p99 and alert above 10ms.

Debug clue: The etcd metric etcd_disk_wal_fsync_duration_seconds is the single best health indicator. If p99 exceeds 10ms, the cluster is at risk of leader instability. On AWS, use io1/io2 EBS volumes with provisioned IOPS, not gp3 — the baseline IOPS on gp3 can be insufficient under write bursts.

3. Even number of cluster members¶

You run a 4-member etcd cluster thinking "more is better." Quorum requires 3 out of 4 (same as a 5-member cluster). You tolerate only 1 failure, same as a 3-member cluster, but with more network overhead.

What happens: No additional fault tolerance compared to 3 members, but more latency and complexity.

Why: Quorum is floor(n/2) + 1. For 4 members, quorum is 3. For 3 members, quorum is 2. Both tolerate 1 failure.

How to avoid: Always run odd numbers: 3 for most clusters, 5 for large or critical clusters. Never 1 in production, never even.

4. Forgetting to compact and defrag¶

The database grows over months because nobody runs compaction. etcd keeps all key revisions. Eventually, it hits the 8GB default quota and enters alarm mode. All writes are rejected. The API server returns 500s.

What happens: etcd stops accepting writes. No new pods, no scaling, no config changes.

Why: Without compaction, old revisions accumulate. Without defrag, freed space is not reclaimed on disk.

How to avoid: Kubernetes runs auto-compaction every 5 minutes by default. Monitor database size. If it grows steadily, check compaction config. Run defrag periodically on one member at a time.

5. Removing two etcd members simultaneously¶

You are replacing nodes and remove two members from a 3-node cluster before adding replacements. Quorum is lost. The remaining member cannot accept writes. The cluster is stuck.

What happens: Quorum loss. The API server becomes read-only or unresponsive.

Why: A 3-member cluster needs 2 members for quorum. Removing 2 leaves 1, below quorum.

How to avoid: Always add the new member before removing the old one. Replace one member at a time. Verify cluster health between each replacement.

War story: A team replaced two out of three etcd nodes simultaneously during a "quick" cluster upgrade. Quorum was lost, the API server went read-only, and all Deployments froze. Recovery required --force-new-cluster on the surviving member, which lost uncommitted writes including a recent RBAC change that had been applied but not yet replicated.

6. Certificate expiry¶

Your etcd TLS certificates expire silently. Members cannot communicate. The cluster partitions. The API server cannot reach etcd. Everything stops.

What happens: Complete cluster failure with cryptic TLS handshake errors in logs.

Why: etcd mutual TLS requires valid certificates. Expired certs are rejected immediately.

How to avoid: Monitor certificate expiry dates. Set alerts 30 days before expiration. On kubeadm clusters, run kubeadm certs renew all and restart kubelet. Automate rotation.

7. Using `--force-new-cluster` without understanding it¶

You lose quorum and panic. You start a surviving member with --force-new-cluster. It works, but you now have a single-member cluster with potentially inconsistent data. You add new members, but some data from the other members' last uncommitted writes is lost.

What happens: Data loss of uncommitted writes. Single point of failure until new members are added.

Why: --force-new-cluster resets the cluster to a single member, discarding any writes that were not committed to this specific member's log.

How to avoid: Prefer restoring from a snapshot. Only use --force-new-cluster as an absolute last resort when no backup exists. Document what was lost.

etcd shares a disk with the OS, kubelet logs, and container image storage. A log rotation failure fills the disk. etcd cannot write its WAL. The cluster goes down.

What happens: Disk full kills etcd. API server becomes unavailable.

Why: etcd needs reliable, predictable disk I/O. Competing workloads cause latency spikes and can exhaust disk space.

How to avoid: Dedicate a disk or partition to etcd data. Mount etcd's data directory on a separate volume. Monitor disk utilization with alerts at 80%.

9. Restoring a snapshot on only one member¶

You restore a snapshot on one member but forget to restore on the other two. The restored member has different data than the others. The cluster cannot reach consensus.

What happens: Split state. Members disagree on data. The cluster may fail to elect a leader.

Why: Snapshot restore creates a new cluster identity. All members must restore from the same snapshot and join as a new cluster.

How to avoid: Always restore on every member. Stop all members first. Each member gets the same snapshot but with its own --name and --initial-advertise-peer-urls.

10. Running etcd members across availability zones with default timeouts¶

Your 3 etcd members span 3 AZs with 5-15ms cross-AZ latency. The default heartbeat interval is 100ms and election timeout is 1000ms. Under load, network jitter causes spurious leader elections.

What happens: Frequent leader elections degrade write performance and cause intermittent API server errors.

Why: Default timeouts assume local network latency (<1ms). Cross-AZ latency is 10-50x higher.

How to avoid: Increase --heartbeat-interval to 500ms and --election-timeout to 5000ms for cross-AZ deployments. Better yet, keep all etcd members in the same AZ.