k8s
l1
topic-pack
etcd --- Portal | Level: L1: Foundations | Topics: etcd | Domain: Kubernetes

etcd - Primer¶

Why This Matters¶

etcd is the brain of every Kubernetes cluster. Every object you create — every Pod, Service, ConfigMap, Secret, Deployment, RBAC rule — is persisted in etcd. When etcd is healthy, the cluster works. When etcd is slow, the API server is slow. When etcd is down, the cluster is brain-dead: no scheduling, no scaling, no self-healing. Understanding etcd is not optional for anyone who operates Kubernetes in production.

What etcd Stores¶

etcd is a distributed key-value store that holds all Kubernetes cluster state. The API server is the only component that talks to etcd directly.

Fun fact: etcd was created by CoreOS in 2013. The name is a play on the Unix /etc directory (where configuration lives) + "d" for distributed. It uses the Raft consensus algorithm, published by Diego Ongaro and John Ousterhout at Stanford in 2014, which was explicitly designed to be more understandable than Paxos (the previous gold standard for consensus).

Stored data includes: - All Kubernetes resource definitions (Pods, Deployments, Services, etc.) - Cluster configuration (RBAC policies, admission webhooks, resource quotas) - Secrets and ConfigMaps (encrypted at rest if configured) - Lease objects (leader election, node heartbeats) - Service account tokens - Custom Resource Definitions and their instances

etcd does NOT store: container images, logs, metrics, or persistent volume data. It stores the metadata that describes where these things live.

Keys follow the pattern /registry/<resource-type>/<namespace>/<name>. For example: /registry/pods/default/nginx-7d9fc.

Raft Consensus¶

etcd uses the Raft consensus algorithm to replicate data across cluster members. This guarantees strong consistency — a read always returns the most recent write acknowledged by the cluster.

Key Raft concepts: - Leader election: one member is the leader at any time. All writes go through the leader. - Log replication: the leader replicates write-ahead log entries to followers. - Quorum: a majority of members must acknowledge a write for it to commit. For a 3-member cluster, quorum is 2. For 5 members, quorum is 3. - Term: a monotonically increasing number that identifies a leader's reign. A new election increments the term.

Cluster Size	Quorum	Tolerated Failures
1	1	0
3	2	1
5	3	2
7	4	3

Always run an odd number of etcd members. Even numbers (e.g., 4) require the same quorum as the next odd number (3) but tolerate fewer failures — all cost, no benefit.

Remember: Quorum formula: (N/2) + 1 (rounded down for the division, then +1). For 3 members: quorum = 2. For 5 members: quorum = 3. Mnemonic: "majority rules" — you always need more than half to agree. A 4-member cluster needs 3 for quorum (same as a 5-member cluster), but tolerates only 1 failure (vs 2 for a 5-member cluster). This is why even numbers are strictly worse.

Gotcha: Running a single etcd member in production means zero fault tolerance. Any disk failure, kernel panic, or OOM kill loses your entire cluster state. A 3-member cluster is the minimum for production — 5 members if you need to tolerate 2 simultaneous failures (e.g., during rolling upgrades across availability zones).

Under the hood: Raft was explicitly designed as a more understandable alternative to Paxos. The Raft paper by Diego Ongaro (Stanford, 2014) includes the memorable subtitle "In Search of an Understandable Consensus Algorithm." In user studies, Raft was significantly easier for students to learn than Paxos. This understandability is a practical advantage: when your etcd cluster is in a degraded state at 3 AM, you need to reason about quorum and leader election — and Raft makes that reasoning tractable.

etcdctl Essentials¶

etcdctl is the CLI tool for interacting with etcd. Always set the API version:

export ETCDCTL_API=3

Basic Operations¶

etcdctl put /app/config/db_host "postgres.internal"   # write a key
etcdctl get /app/config/db_host                        # read a key
etcdctl get /app/config/ --prefix                      # list keys by prefix
etcdctl watch /app/config/ --prefix                    # watch for changes
etcdctl del /app/config/db_host                        # delete a key
etcdctl member list --write-out=table                  # list cluster members

TLS Authentication¶

Production clusters require TLS client certificates. Standard kubeadm paths:

etcdctl --endpoints=https://etcd-0:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  member list

For managed Kubernetes (EKS, GKE, AKS), etcd is managed by the provider and not directly accessible.

Backup and Restore¶

etcd backups are the single most critical backup in a Kubernetes cluster. Without them, a total etcd failure means rebuilding the entire cluster from scratch.

Snapshot Save¶

etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table  # verify

Snapshot Restore¶

Restoring is destructive — it creates a new data directory. Stop the API server and etcd on all control plane nodes first, then restore on each member:

etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restored \
  --name=etcd-0 \
  --initial-cluster=etcd-0=https://10.0.1.10:2380,etcd-1=https://10.0.1.11:2380,etcd-2=https://10.0.1.12:2380 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

Critical: restore must be done on every member, each with its own --name and --initial-advertise-peer-urls. Then point etcd config to the new data directory and restart.

Backup Schedule¶

Minimum: hourly snapshots with 7-day retention. Store off-cluster (S3, GCS, NFS). Test restores regularly — an untested backup is not a backup.

Cluster Health¶

etcdctl endpoint health --cluster                      # quick health check
etcdctl endpoint status --write-out=table --cluster    # detailed status per member

Key Health Indicators¶

Indicator	Healthy	Warning
Round-trip latency	< 10ms	> 50ms
Leader changes (per hour)	0-1	> 3
Proposal failures	0	Any
Database size	< 4GB	> 6GB (default limit 8GB)
WAL fsync duration	< 10ms	> 50ms

Compaction and Defragmentation¶

etcd keeps a history of all key revisions. Without compaction, the database grows indefinitely.

Compaction¶

rev=$(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision')
etcdctl compact $rev    # removes old revisions up to current

Kubernetes runs automatic compaction every 5 minutes by default (--etcd-compaction-interval). Manual compaction is needed only for troubleshooting or recovery.

Defragmentation¶

Compaction marks space as free but does not reclaim it on disk. Defragmentation reclaims the space. It blocks the member during execution — run on one member at a time.

etcdctl defrag --endpoints=https://etcd-0:2379

Authentication¶

etcd supports built-in RBAC (etcdctl user add, etcdctl role grant-permission), but in Kubernetes clusters, authentication is handled via TLS client certificates. The API server presents its client cert to etcd, validated against the etcd CA. Built-in RBAC is more relevant for standalone etcd deployments.

War story: A common etcd disaster: an operator runs etcdctl compact and etcdctl defrag on all members simultaneously. Defrag blocks the member for the duration — on a 4 GB database, this can take 30+ seconds. If all members are blocked at once, the cluster has no quorum and the API server returns errors. Always defrag one member at a time, waiting for it to rejoin before moving to the next.

Performance Tuning¶

Disk¶

etcd is extremely sensitive to disk latency. The write-ahead log (WAL) requires fast, synchronous writes.

Use SSDs (NVMe preferred). Spinning disks will cause leader elections under load.
Dedicated disk for etcd data — do not share with OS or other workloads.
Monitor wal_fsync_duration_seconds — if p99 exceeds 10ms, disk is the bottleneck.

Network¶

Dedicated network for peer traffic (port 2380) when possible.
Keep etcd members in the same availability zone or datacenter. Cross-AZ latency directly impacts write latency.
--heartbeat-interval default is 100ms. For high-latency networks, increase to 500ms.
--election-timeout default is 1000ms. Should be 5-10x the heartbeat interval.

Memory and Quotas¶

Set --quota-backend-bytes (default 2GB, max recommended 8GB). If the database exceeds the quota, etcd enters alarm mode and rejects all writes. Fix: compact, defrag, then etcdctl alarm disarm.

Disaster Recovery¶

Single Member Failure¶

The cluster continues with quorum. Replace the failed member: etcdctl member remove <id>, then etcdctl member add <name> --peer-urls=..., and start the new node with --initial-cluster-state=existing.

Quorum Loss¶

If a majority of members fail (e.g., 2 out of 3), the cluster cannot accept writes. Options:

Restore from snapshot (preferred): restore on new nodes using the latest backup.
Force new cluster: start a single surviving member with --force-new-cluster. This resets the cluster to a single-member state. Then add new members. This is a last resort — it can cause data inconsistency.

Total Loss¶

If all members are lost and no snapshot exists, the Kubernetes cluster state is gone. You must rebuild the cluster from scratch and redeploy all workloads. This is why backups are non-negotiable.

Common Failure Modes¶

Quorum Loss¶

Symptoms: API server returns errors, no new pods scheduled, etcdctl endpoint health times out. Cause: majority of etcd members down. Fix: restore from snapshot or force new cluster from surviving member.

Disk Full¶

Symptoms: etcd rejects writes, mvcc: database space exceeded in logs, API server returns 500s. Fix: compact, defrag, clear alarm, increase quota or disk size.

Certificate Expiry¶

Symptoms: etcd members cannot communicate, TLS handshake errors in logs. Fix: rotate certificates before expiry. kubeadm clusters: kubeadm certs renew all. Monitor expiry dates:

openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -enddate

Slow Disk¶

Symptoms: frequent leader elections, high wal_fsync_duration_seconds, API server latency spikes. Fix: move to SSD, isolate etcd on dedicated disk, reduce competing I/O.

Split Brain (Network Partition)¶

Symptoms: different clients see different data, two members claim to be leader. In practice, Raft prevents true split-brain — the minority partition loses quorum and becomes read-only. But stale reads from the minority partition can confuse monitoring. Fix: resolve the network partition, members will reconcile automatically.

Debug clue: The mvcc: database space exceeded error means etcd has hit its --quota-backend-bytes limit. The fix sequence is always the same: 1) etcdctl alarm list (confirm the NOSPACE alarm), 2) get the current revision, 3) etcdctl compact <revision>, 4) etcdctl defrag --endpoints=<one-at-a-time>, 5) etcdctl alarm disarm. Memorize this sequence — it is one of the most common etcd emergencies.

Quick Reference¶

etcdctl endpoint health --cluster                       # health check
etcdctl member list --write-out=table                   # member list
etcdctl endpoint status --write-out=table               # database size + leader
etcdctl snapshot save /backup/snapshot.db               # backup
etcdctl snapshot status /backup/snapshot.db -w table    # verify backup
etcdctl alarm list                                      # check alarms

Key Takeaways¶

etcd holds ALL Kubernetes state. Protect it accordingly.
Always run odd-numbered clusters (3 or 5 members). Never 1 in production.
Back up etcd hourly, store off-cluster, and test restores regularly.
SSD storage is mandatory — disk latency is the most common etcd performance problem.
Monitor database size, WAL fsync duration, and leader changes as primary health signals.
Compaction and defragmentation are routine maintenance — not optional.
Certificate expiry is a silent killer. Automate rotation or monitor expiry dates aggressively.
Quorum loss is recoverable from snapshots. Total loss without backups is catastrophic.

Interview: etcd Space Exceeded (Scenario, L3) — etcd
Runbook: etcd Backup & Restore (Runbook, L2) — etcd
Runbook: etcd High Latency / Slow Operations (Runbook, L3) — etcd
Scenario: etcd Troubleshooting (Scenario, L3) — etcd
Skillcheck: etcd (Assessment, L2) — etcd
etcd Drills (Drill, L2) — etcd
etcd Flashcards (CLI) (flashcard_deck, L1) — etcd

etcd - Primer¶

Why This Matters¶

What etcd Stores¶

Raft Consensus¶

etcdctl Essentials¶

Basic Operations¶

TLS Authentication¶

Backup and Restore¶

Snapshot Save¶

Snapshot Restore¶

Backup Schedule¶

Cluster Health¶

Key Health Indicators¶

Compaction and Defragmentation¶

Compaction¶

Defragmentation¶

Authentication¶

Performance Tuning¶

Disk¶

Network¶

Memory and Quotas¶

Disaster Recovery¶

Single Member Failure¶

Quorum Loss¶

Total Loss¶

Common Failure Modes¶

Quorum Loss¶

Disk Full¶

Certificate Expiry¶

Slow Disk¶

Split Brain (Network Partition)¶

Quick Reference¶

Key Takeaways¶

Wiki Navigation¶

Pages that link here¶

etcd - Primer¶

Why This Matters¶

What etcd Stores¶

Raft Consensus¶

etcdctl Essentials¶

Basic Operations¶

TLS Authentication¶

Backup and Restore¶

Snapshot Save¶

Snapshot Restore¶

Backup Schedule¶

Cluster Health¶

Key Health Indicators¶

Compaction and Defragmentation¶

Compaction¶

Defragmentation¶

Authentication¶

Performance Tuning¶

Disk¶

Network¶

Memory and Quotas¶

Disaster Recovery¶

Single Member Failure¶

Quorum Loss¶

Total Loss¶

Common Failure Modes¶

Quorum Loss¶

Disk Full¶

Certificate Expiry¶

Slow Disk¶

Split Brain (Network Partition)¶

Quick Reference¶

Key Takeaways¶

Wiki Navigation¶

Related Content¶

Pages that link here¶