Skip to content

etcd: The Database That Runs Kubernetes

  • lesson
  • etcd
  • raft-consensus
  • distributed-systems
  • kubernetes-control-plane
  • backup/restore
  • disk-performance
  • monitoring ---# etcd — The Database That Runs Kubernetes

Topics: etcd, Raft consensus, distributed systems, Kubernetes control plane, backup/restore, disk performance, monitoring Level: L1–L2 (Foundations to Operations) Time: 75–90 minutes Prerequisites: None (Kubernetes basics explained where needed)


The Mission

It's 9:47 AM on a Wednesday. Developers are pinging your team in Slack: kubectl get pods is taking 30 seconds. kubectl apply sometimes works, sometimes times out. The Kubernetes dashboard is blank. Deployments are stuck. Nothing is scaling.

You check the API server logs:

etcdserver: request timed out
etcdserver: leader changed

Two lines. The entire cluster's brain is misfiring.

This isn't a pod problem. It isn't a networking problem. It's etcd — the database that every single Kubernetes operation reads from and writes to. And right now, it's drowning.

By the end of this lesson you'll understand: - What etcd actually stores (and what it doesn't) - How Raft consensus works — the algorithm that keeps etcd's data consistent across nodes - The etcdctl commands that diagnose problems in minutes instead of hours - Why your disk choice is the single most important etcd decision - How to back up and restore etcd (the one backup that matters most) - Why clusters should have 3 or 5 members, never 2 or 4 - The Prometheus metrics that predict etcd trouble before it hits


Part 1: What etcd Actually Is

Let's start with what you're looking at. etcd is a distributed key-value store. Think of it as a giant hash map that lives across multiple servers, where every server has an identical copy of the data, and they all agree on what the current state is before any write is accepted.

In a Kubernetes cluster, the API server is the only component that talks to etcd directly. Every kubectl command you run goes through the API server, which reads from or writes to etcd. The scheduler, the controller manager, kubelets — they all talk to the API server. None of them touch etcd.

Name Origin: The name "etcd" is a mashup of the Unix /etc directory (the traditional home of system configuration files) and "d" for distributed. Pronounced "et-see-dee." It describes exactly what it does: distributed /etc — configuration storage spread across machines.

What's in etcd (and what isn't)

Stored: Pod/Deployment/Service definitions, ConfigMaps, Secrets, RBAC policies, namespaces, node heartbeat leases, CRDs and their instances, network policies — everything that defines the cluster's desired state.

NOT stored: Container images, application logs, Prometheus metrics, persistent volume data. etcd holds the metadata — the blueprint, not the building.

Keys follow the pattern /registry/<resource-type>/<namespace>/<name>. A pod named nginx-7d9fc in the default namespace lives at /registry/pods/default/nginx-7d9fc.

Mental Model: etcd is to Kubernetes what a hotel's reservation system is to the hotel. The system knows every guest, every room assignment, every checkout time. If the reservation system goes down, nobody can check in, check out, or switch rooms — even though the physical hotel is still standing and guests are still in their rooms. Existing guests are fine (running pods keep running). But nothing new can happen.


Part 2: The Incident — Finding the Bottleneck

Back to our 9:47 AM crisis. The API server is timing out. Let's diagnose.

First, you need etcdctl and the TLS certificates that etcd requires for authentication. On a kubeadm cluster, the certs live in a predictable place:

export ETCDCTL_API=3

# Store the cert flags in a variable — you'll use these constantly
ETCD_CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key"
Flag What it is Why it's needed
--cacert Certificate Authority cert Verifies the etcd server's identity
--cert Client certificate Proves you're allowed to talk to etcd
--key Client private key Cryptographic proof you own the cert

Gotcha: If you're on a managed Kubernetes service (EKS, GKE, AKS), you can't directly access etcd at all. The cloud provider manages it for you. The commands in this lesson apply to self-managed clusters (kubeadm, k3s, bare metal, etc.).

Step 1: Is etcd alive?

etcdctl endpoint health --cluster $ETCD_CERTS

Healthy output:

https://10.0.1.10:2379 is healthy: successfully committed proposal: took = 2.34ms
https://10.0.1.11:2379 is healthy: successfully committed proposal: took = 3.12ms
https://10.0.1.12:2379 is healthy: successfully committed proposal: took = 2.87ms

But during our incident, you see:

https://10.0.1.10:2379 is healthy: successfully committed proposal: took = 487.23ms
https://10.0.1.11:2379 is healthy: successfully committed proposal: took = 512.08ms
https://10.0.1.12:2379 is healthy: successfully committed proposal: took = 1.203s

Those took values should be under 10ms. Half a second means etcd is suffocating.

Remember: The health check mnemonic is H-S-M — Health, Status, Members. Three commands, three angles: endpoint health (latency), endpoint status (DB size + who's leader), member list (quorum). If you can't remember the exact commands, remember the three letters.

Step 2: Who's the leader, and is it stable?

etcdctl endpoint status --write-out=table --cluster $ETCD_CERTS
+---------------------------+------------------+---------+---------+-----------+------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT INDEX |
+---------------------------+------------------+---------+---------+-----------+------------+
| https://10.0.1.10:2379    | 8e9e05c52164694d |  3.5.12 |  3.8 GB |     false |    4892841 |
| https://10.0.1.11:2379    | a7fa3b248c0217a  |  3.5.12 |  3.8 GB |      true |    4892841 |
| https://10.0.1.12:2379    | c1d2e3f4a5b6c7d8 |  3.5.12 |  3.8 GB |     false |    4892838 |
+---------------------------+------------------+---------+---------+-----------+------------+

Two things jump out: DB size is 3.8 GB (getting high — the default quota is 2 GB, and even with it raised, 8 GB is the recommended max). And one follower is 3 raft indexes behind the others — a sign of replication lag.

Step 3: Check for alarms

etcdctl alarm list $ETCD_CERTS

If etcd has hit its storage quota, you'll see:

memberID:8e9e05c52164694d alarm:NOSPACE
memberID:a7fa3b248c0217a alarm:NOSPACE

The NOSPACE alarm means etcd has stopped accepting writes. The entire Kubernetes cluster is now read-only. No new pods. No config changes. No scaling.

Step 4: What's eating the space?

etcdctl get /registry --prefix --keys-only $ETCD_CERTS | \
  awk -F/ '{print $3}' | sort | uniq -c | sort -rn | head -10
   18247 events
    3891 leases
    1204 pods
     847 configmaps
     523 secrets
     412 deployments

18,000+ events. That's usually the culprit. Kubernetes Events are verbose and accumulate fast, especially in busy clusters.

Under the Hood: Kubernetes creates Event objects for almost everything: pod scheduled, image pulled, container started, health check passed, volume mounted. In a 200-pod cluster with frequent deployments, that's thousands of events per hour. Events have a default TTL of 1 hour, but until compaction runs, their old revisions consume etcd space.


Flashcard Check #1

Cover the answers.

Q1: What key pattern does Kubernetes use to store objects in etcd?

/registry/<resource-type>/<namespace>/<name>. Example: /registry/pods/default/nginx-7d9fc.

Q2: Can the Kubernetes scheduler talk directly to etcd?

No. Only the API server communicates with etcd. All other components (scheduler, controller manager, kubelet) go through the API server.

Q3: If etcd goes down but nothing restarts, do running pods keep running?

Yes. Running pods continue because the kubelet keeps them alive locally. But nothing new can happen — no scaling, no scheduling, no config changes.


Part 3: Raft Consensus — How etcd Keeps Data Consistent

Every write to etcd must be agreed upon by a majority of cluster members before it's considered committed. This is the Raft consensus algorithm, and understanding it explains almost every etcd behavior you'll encounter in production.

The restaurant analogy

Imagine a restaurant chain with 3 locations. They share one menu. When the head chef (the leader) wants to add a new dish:

  1. The head chef writes the new dish on a card and sends copies to the other two locations
  2. Each location chef reviews it and says "yes, I've added it to my menu"
  3. Once the head chef hears back from at least one other chef (giving a majority of 2 out of 3), the dish is officially on the menu
  4. The head chef tells any remaining location about the committed change on the next check-in

If the head chef gets hit by a bus, the two remaining locations hold an emergency meeting. One of them becomes the new head chef. They have all the committed menu changes because no change was committed without a majority agreeing. The chain keeps running.

This is Raft. The "menu" is the data. The "dish card" is a log entry. The "majority" is quorum.

Trivia: Raft was created by Diego Ongaro and John Ousterhout at Stanford in 2013. The paper's full title is "In Search of an Understandable Consensus Algorithm" — they explicitly designed it as an alternative to Paxos, which was notoriously difficult to understand and implement correctly. In user studies, students learned Raft significantly faster than Paxos. This matters at 3 AM when your cluster is degraded and you need to reason about quorum — Raft is designed to be tractable under pressure.

The three roles

Every etcd member is in one of three states:

  FOLLOWER ──(timeout, no heartbeat)──→ CANDIDATE ──(wins vote)──→ LEADER
     ↑                                       │                        │
     └───(discovers current leader)──────────┘                        │
     └───(receives heartbeat)─────────────────────────────────────────┘
  • Leader: Handles all writes. Sends heartbeats to followers. There is exactly one leader.
  • Follower: Receives replicated data from the leader. Can serve reads (in some configs).
  • Candidate: A follower that hasn't heard from a leader and is trying to become one.

How a write happens

kubectl apply → API server → etcd leader
  1. Leader writes to its WAL (uncommitted)
  2. Leader sends AppendEntries to followers
  3. Followers write to their WAL, acknowledge
  4. Leader waits for quorum acknowledgment
  5. Leader commits, responds to API server

Step 4 is why disk latency matters so much. Slow follower WAL writes = slow quorum = slow API server = slow kubectl apply.

The quorum math

This is the most important arithmetic in distributed systems:

Quorum = floor(N/2) + 1

Translation: you need more than half the members to agree.

Cluster Size Quorum Can Tolerate Notes
1 1 0 failures Dev only. Any failure = total loss
2 2 0 failures Worse than 1. Both must agree, but either failing kills quorum
3 2 1 failure Production minimum
4 3 1 failure Same tolerance as 3, more overhead. Never do this
5 3 2 failures For critical clusters, cross-AZ deployments
7 4 3 failures Rare. More members = higher write latency

Gotcha: A 2-member cluster is strictly worse than a single member for availability. With 1 member, you need 1 to agree (always available while it's up). With 2 members, you need 2 to agree — if either fails, the cluster can't accept writes. You've doubled your failure surface for zero benefit.

Interview Bridge: "Why should etcd clusters always have an odd number of members?" is a common Kubernetes interview question. The answer: even numbers require the same quorum as the next odd number but tolerate fewer failures. A 4-member cluster needs 3 for quorum (same as 5) but tolerates only 1 failure (vs 2 for 5). Even sizes add cost without improving fault tolerance.


Part 4: The Fix — Disk Latency Was the Killer

Back to our incident. The endpoint health check showed 500ms+ response times. Let's find out why.

Checking disk performance

# Quick synthetic test on the etcd data directory
dd if=/dev/zero of=/var/lib/etcd/test bs=512 count=1000 oflag=dsync 2>&1 | tail -1
rm /var/lib/etcd/test

If this shows throughput below 50 MB/s, your disk is too slow for etcd under load.

The real metric to watch is the WAL fsync duration — how long it takes etcd to write its write-ahead log to disk and get confirmation that the data is durable. In Prometheus:

histogram_quantile(0.99,
  rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)
WAL fsync p99 Status Action
< 10ms Healthy Normal operation
10–50ms Degraded Investigate disk I/O, check for competing workloads
50–100ms Critical Leader elections likely. Move to SSD immediately
> 100ms Emergency Cluster is in an election loop. Expect API server errors

War Story: A team running Kubernetes on AWS used gp3 EBS volumes for their etcd data directory. gp3 provides 3,000 baseline IOPS — plenty for most workloads. But during a spike in Kubernetes API activity (a CI/CD pipeline running 200 parallel jobs), etcd's write-ahead log needed sustained sequential writes faster than gp3 could deliver. WAL fsync latency spiked above 100ms. The leader couldn't replicate fast enough. Followers timed out and triggered elections. The new leader also couldn't keep up. The cluster entered an election loop — leader elected, overwhelmed, new election, repeat — for 12 minutes. The fix: moving etcd to io2 volumes with provisioned IOPS. The deeper fix: giving etcd its own dedicated volume separate from the OS and kubelet.

The fix we applied

In our incident, etcd was sharing a disk with the operating system, kubelet, and container image storage. A large image pull flooded the disk I/O, starving etcd's WAL writes.

# Check what's sharing the disk with etcd
df -h /var/lib/etcd
lsof +D /var/lib/etcd/../   # what else is writing to this filesystem

The immediate fix was migrating etcd's data to a dedicated NVMe volume. The longer-term fix was adding monitoring on WAL fsync latency with alerts at 10ms.

Performance tuning reference

Parameter Default Production Why
Storage OS default Dedicated NVMe SSD WAL fsync is the #1 bottleneck
--heartbeat-interval 100ms 500ms cross-AZ Must exceed network RTT
--election-timeout 1000ms 5–10x heartbeat Too low = spurious elections
--quota-backend-bytes 2 GB 8 GB Default is too small for production

Gotcha: Cross-AZ etcd members with default timeouts cause spurious elections. Network latency of 5–15ms between AZs eats into the 100ms heartbeat interval. Either increase timeouts or (better) keep all etcd members in the same AZ.


Part 5: Compaction and Defragmentation — Keeping etcd Healthy

etcd doesn't just store the current value of a key — it keeps a history of every revision. When you update a ConfigMap 50 times, etcd has 50 versions of that key. This is what enables watches (the API server can ask "what changed since revision 4892800?"), but it also means the database grows indefinitely without maintenance.

Compaction: deleting old history

Compaction tells etcd "I don't need any revisions older than X." Kubernetes runs auto- compaction every 5 minutes by default, but understanding the manual process matters for emergencies:

# Get the current revision
REV=$(etcdctl endpoint status --write-out=json $ETCD_CERTS | \
  jq '.[0].Status.header.revision')
echo "Current revision: $REV"

# Compact everything up to this revision
etcdctl compact $REV $ETCD_CERTS

After compaction, the old revisions are marked as free — but the database file on disk doesn't shrink. That's where defragmentation comes in.

Defragmentation: reclaiming disk space

# IMPORTANT: defrag blocks ALL reads and writes on the target member
# Run on one member at a time, starting with non-leaders

# Defrag member 1 (non-leader)
etcdctl defrag --endpoints=https://10.0.1.10:2379 $ETCD_CERTS

# Wait for it to rejoin, verify health
etcdctl endpoint health --cluster $ETCD_CERTS

# Defrag member 2 (non-leader)
etcdctl defrag --endpoints=https://10.0.1.12:2379 $ETCD_CERTS

# Finally, defrag the leader (will trigger a leader election)
etcdctl defrag --endpoints=https://10.0.1.11:2379 $ETCD_CERTS

Gotcha: Never defrag all members simultaneously. A team once scripted etcdctl defrag --endpoints=<all-three-members> thinking it would be faster. Defrag blocks all reads and writes on the target member for the duration — on a 4 GB database, that can be 30+ seconds. With all three members blocked, the cluster lost quorum and the API server returned errors for 45 seconds. Always defrag one at a time, non-leaders first, leader last.

The NOSPACE emergency sequence

When etcd hits its quota and enters alarm mode, memorize this sequence:

# 1. Confirm the alarm
etcdctl alarm list $ETCD_CERTS

# 2. Get current revision
REV=$(etcdctl endpoint status --write-out=json $ETCD_CERTS | \
  jq '.[0].Status.header.revision')

# 3. Compact old revisions
etcdctl compact $REV $ETCD_CERTS

# 4. Defrag (one member at a time!)
etcdctl defrag --endpoints=https://10.0.1.10:2379 $ETCD_CERTS

# 5. Disarm the alarm
etcdctl alarm disarm $ETCD_CERTS

# 6. Verify writes work again
etcdctl put /test/healthcheck "ok" $ETCD_CERTS
etcdctl del /test/healthcheck $ETCD_CERTS

Remember: The NOSPACE fix sequence: A-C-D-D — Alarm (list), Compact, Defrag, Disarm. Four steps, always in this order. You'll probably need this at 3 AM at least once.

Under the Hood: etcd uses bbolt (a fork of BoltDB) as its storage engine — a copy-on-write B+ tree. When a key is updated, the old page isn't overwritten; a new page is written and the old one is marked free. This is what makes defragmentation necessary: deleted data leaves "holes" in the file that can only be reclaimed by rewriting the entire database into a new, compact file. It's the same reason SQLite has VACUUM.


Flashcard Check #2

Q4: What's the default etcd database quota, and what happens when it's exceeded?

Default is 2 GB. When exceeded, etcd enters alarm mode and rejects all writes. The Kubernetes cluster becomes read-only — no new pods, no config changes, no scaling. Fix: compact, defrag, alarm disarm.

Q5: Why must you defrag one etcd member at a time?

Defrag blocks all reads and writes on the target member. If all members are defragging simultaneously, the cluster has no quorum and the API server returns errors.

Q6: What's the relationship between compaction and defragmentation?

Compaction removes old key revision history, marking space as free. Defragmentation reclaims that freed space on disk. You need both: compact first, then defrag.


Part 6: Backup and Restore — The One Backup That Matters Most

If you back up one thing in your entire infrastructure, back up etcd. Without an etcd backup, a total cluster failure means rebuilding everything from scratch — every Deployment, Service, Secret, RBAC rule, and CRD, from memory or (hopefully) from your GitOps repo.

Creating a snapshot

etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 $ETCD_CERTS

Always verify the snapshot immediately:

etcdctl snapshot status /backup/etcd-20260323-094700.db --write-out=table
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 9f8c2d1a |   482918 |       1247 |    4.2 MB  |
+----------+----------+------------+------------+

War Story: A team had etcd backups running hourly via a cron job. For six months, the backups ran without complaint. When they finally needed to restore after a catastrophic failure, every snapshot was a zero-byte file. The backup script was writing to a volume that had quietly filled up months ago, and etcdctl snapshot save with no disk space produces an empty file without a non-zero exit code in some versions. The team rebuilt their entire cluster from their Helm charts and GitOps repo — a 14-hour process. Now they run etcdctl snapshot status after every backup and alert if the file size is below 1 MB.

Backup schedule

Minimum: hourly snapshots, 7-day retention, stored off-cluster. The key additions beyond the basic snapshot save command: verify with snapshot status after every save, and copy to object storage (S3, GCS) automatically. A cron job on each control plane node works. Test restores monthly — an untested backup is not a backup.

Restoring from a snapshot

Restore is destructive — you're creating a new cluster from the snapshot. Every member must be restored. The steps:

# 1. Stop API server and etcd on ALL control plane nodes (kubeadm)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/

# 2. Restore on EACH member (same snapshot, different --name and --initial-advertise-peer-urls)
etcdctl snapshot restore /backup/etcd-20260323-094700.db \
  --data-dir=/var/lib/etcd-restored \
  --name=etcd-0 \
  --initial-cluster="etcd-0=https://10.0.1.10:2380,etcd-1=https://10.0.1.11:2380,etcd-2=https://10.0.1.12:2380" \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# 3. Point etcd config to new data dir, restore static pod manifests, verify
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
etcdctl endpoint health --cluster $ETCD_CERTS

Gotcha: Restoring on only one member out of three is a common mistake. The restored member has a different cluster identity than the others. The cluster won't elect a leader. Always restore on every member from the same snapshot.


Part 7: etcd in kubeadm Clusters

In kubeadm, etcd runs as a static pod (manifest at /etc/kubernetes/manifests/etcd.yaml). Data lives in /var/lib/etcd/member/. When kubectl isn't responding (because etcd is down), use crictl logs $(crictl ps --name etcd -q) to read etcd logs directly.

Certificate lifecycle — the silent killer

kubeadm etcd certificates expire after 1 year. No warning — one day everything works, the next day members refuse to talk with TLS handshake errors.

# Check all etcd cert expiry dates
for cert in /etc/kubernetes/pki/etcd/*.crt; do
  echo "$cert: $(openssl x509 -in "$cert" -noout -enddate)"
done

# Renew and restart
kubeadm certs renew all
systemctl restart kubelet

Gotcha: Set a monitoring alert that fires 30 days before certificate expiry. Or set a calendar reminder. Anything is better than discovering expiry during a production outage.


Part 8: Monitoring etcd with Prometheus

etcd exposes metrics on port 2381 (or 2379 with a /metrics endpoint, depending on configuration). These are the metrics that predict trouble before users notice.

The critical four

# 1. WAL fsync latency — the single best health indicator
histogram_quantile(0.99,
  rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)
# Alert if > 10ms

# 2. Leader changes — stability indicator
increase(etcd_server_leader_changes_seen_total[1h])
# Alert if > 3 per hour

# 3. Database size — capacity planning
etcd_mvcc_db_total_size_in_bytes
# Alert if > 6GB (default max is 8GB)

# 4. Proposal failures — consensus health
rate(etcd_server_proposals_failed_total[5m])
# Alert if any failures

What rising raftTerm tells you

etcdctl endpoint status --write-out=json --cluster $ETCD_CERTS | \
  jq '.[].Status.raftTerm'

Every leader election increments the raft term. A raftTerm that's jumping by 2–3+ per hour means the cluster is unstable — either the network between control plane nodes is flaky, or the disk is too slow for the heartbeat interval. Investigate both immediately.


Flashcard Check #3

Q7: What is the single most important Prometheus metric for etcd health?

etcd_disk_wal_fsync_duration_seconds. If the p99 exceeds 10ms, the cluster is at risk of leader instability. This metric captures the bottleneck that causes most etcd performance problems.

Q8: How do you tell if etcd leader elections are happening too frequently?

Check the raftTerm via etcdctl endpoint status — if it jumps by more than 2–3 per hour, the cluster is unstable. In Prometheus, monitor increase(etcd_server_leader_changes_seen_total[1h]).

Q9: What's the correct sequence when etcd hits its storage quota?

A-C-D-D: Alarm list, Compact, Defrag (one member at a time), Disarm.


Part 9: Common Failure Modes at a Glance

Failure Symptoms Fix
Quorum loss API server errors, no scheduling, health check timeouts Restore from snapshot; last resort: --force-new-cluster on a survivor
Slow disk Frequent leader elections, WAL fsync > 50ms Dedicated SSD/NVMe. On AWS, io1/io2 with provisioned IOPS
Cert expiry TLS handshake errors, members can't communicate kubeadm certs renew all && systemctl restart kubelet
Space exceeded mvcc: database space exceeded, API 500s A-C-D-D: alarm list, compact, defrag, disarm
Network partition Stale reads from minority side, brief dual-leader in monitoring Resolve partition; Raft prevents true split-brain, members reconcile

War Story: A team upgrading etcd on a 3-member cluster removed two members before adding replacements. Quorum was immediately lost. The API server went read-only. Recovery required --force-new-cluster on the survivor, which lost an uncommitted RBAC change. The correct procedure: add the new member first, verify it's healthy, then remove the old one. One at a time. Always.


Part 10: The History — From CoreOS to CNCF

2013: Brandon Philips and CoreOS create etcd as the config store for CoreOS Linux. They choose Raft because, unlike Paxos, they can actually understand and debug it.

2014: Kubernetes launches. Google's team picks etcd over ZooKeeper (too complex, Java dependency) and Consul (newer, less proven).

2016: etcd v3 ships — a complete rewrite. REST API replaced with gRPC, storage gets MVCC and proper watches. Kubernetes only supports v3.

2018: Red Hat acquires CoreOS for ~$250 million. etcd graduates to a CNCF top-level project, alongside Kubernetes and Prometheus. IBM acquires Red Hat for $34 billion in 2019.

Trivia: Every write goes through the etcd leader — writes are not distributed across members. Adding more members (3 to 5 to 7) improves fault tolerance but actually increases write latency because more acknowledgments are needed for quorum.


Exercises

Exercise 1: Read etcd's state (quick win, 2 minutes)

You have a kubeadm cluster. List the top 5 resource types consuming space in etcd.

# Fill in the missing parts:
etcdctl get /registry --prefix --keys-only $ETCD_CERTS | \
  _____ | sort | uniq -c | sort -rn | head -5
Answer
etcdctl get /registry --prefix --keys-only $ETCD_CERTS | \
  awk -F/ '{print $3}' | sort | uniq -c | sort -rn | head -5
The `awk -F/ '{print $3}'` splits each key by `/` and prints the third field — the resource type (pods, deployments, configmaps, etc.).

Exercise 2: Quorum math (think, don't code)

For each scenario, determine: can the cluster accept writes?

  1. 3-member cluster, 1 member down
  2. 3-member cluster, 2 members down
  3. 5-member cluster, 2 members down
  4. 5-member cluster, 3 members down
  5. 4-member cluster, 2 members down
Answers 1. **Yes.** 2 of 3 alive. Quorum = 2. Satisfied. 2. **No.** 1 of 3 alive. Quorum = 2. Not satisfied. Read-only. 3. **Yes.** 3 of 5 alive. Quorum = 3. Satisfied. 4. **No.** 2 of 5 alive. Quorum = 3. Not satisfied. 5. **No.** 2 of 4 alive. Quorum = 3 (floor(4/2) + 1 = 3). Not satisfied. This is why 4-member clusters are worse than 3 — same quorum as 5, but less fault tolerance.

Exercise 3: Diagnose the incident (judgment call)

etcdctl endpoint status shows all three members at 7.9 GB DB size. etcdctl alarm list shows NOSPACE on all members. One follower is 13 raft indexes behind the leader.

What's wrong, and what do you do — in order?

Answer Database nearly at 8 GB max. NOSPACE alarm = read-only cluster. Fix in order: 1. Get current revision 2. `etcdctl compact ` 3. Defrag non-leaders first, one at a time, verify health between each 4. Defrag leader last (triggers election) 5. `etcdctl alarm disarm` 6. Verify writes work 7. Investigate which resource type is consuming space; increase quota if needed The 13-index lag on the follower is minor — replication delay from disk pressure. It'll catch up after defrag frees I/O.

Exercise 4: Design a backup strategy (design)

Your company runs a 5-node etcd cluster backing a production Kubernetes cluster with 500 pods and frequent deployments. Design: backup frequency, retention, storage location, verification, and restore testing cadence.

Answer (one good approach) - **Frequency:** Every 30 minutes (hourly is minimum; more frequent for active clusters) - **Retention:** 7 days locally, 30 days in S3/GCS with versioning - **Storage:** Dedicated local volume, replicated to object storage. Never on the same disk as etcd. - **Verification:** `etcdctl snapshot status` after every save; alert if total keys = 0 or file size < 1 MB - **Restore testing:** Monthly, to a separate isolated cluster. An untested backup is not a backup.

Cheat Sheet

Health Assessment (run these first)

What Command
Cluster health + latency etcdctl endpoint health --cluster
DB size + leader + raft index etcdctl endpoint status -w table --cluster
Member list + peer URLs etcdctl member list -w table
Check alarms etcdctl alarm list
Certificate expiry openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -enddate

Backup & Restore

What Command
Create snapshot etcdctl snapshot save /path/snapshot.db
Verify snapshot etcdctl snapshot status /path/snapshot.db -w table
Restore (per member) etcdctl snapshot restore snapshot.db --data-dir=/var/lib/etcd-new --name=<name> --initial-cluster=<...> --initial-advertise-peer-urls=<url>

Maintenance

What Command
Compact to current revision REV=$(etcdctl endpoint status -w json \| jq '.[0].Status.header.revision'); etcdctl compact $REV
Defrag (one member at a time!) etcdctl defrag --endpoints=<single-endpoint>
Clear NOSPACE alarm etcdctl alarm disarm
Audit space by resource type etcdctl get /registry --prefix --keys-only \| awk -F/ '{print $3}' \| sort \| uniq -c \| sort -rn

Prometheus Metrics to Alert On

Metric Threshold Meaning
etcd_disk_wal_fsync_duration_seconds p99 > 10ms Disk too slow
etcd_server_leader_changes_seen_total > 3/hr Cluster unstable
etcd_mvcc_db_total_size_in_bytes > 6 GB Approaching quota
etcd_server_proposals_failed_total Any > 0 Consensus failures

Takeaways

  1. etcd is the single source of truth for Kubernetes. When etcd is slow, the API server is slow. When etcd is down, the cluster is brain-dead. Protect it accordingly.

  2. Disk latency is the #1 etcd killer. WAL fsync > 10ms means trouble. Use dedicated SSDs. Monitor wal_fsync_duration_seconds religiously.

  3. Always run 3 or 5 members, never 2 or 4. Even numbers give you the same quorum requirement as the next odd number but tolerate fewer failures. It's all cost, no benefit.

  4. Back up etcd and verify the backups. Hourly minimum. Store off-cluster. Run snapshot status after every save. Test restores. An unverified backup is a false promise.

  5. The NOSPACE fix is A-C-D-D. Alarm list, Compact, Defrag (one at a time!), Disarm. You'll need this at 3 AM. Memorize it.

  6. Raft consensus means majority rules. Writes need quorum to commit. Leader handles all writes. More members = more fault tolerance but higher write latency.


  • The Split-Brain Nightmare — deep dive into network partitions and consensus
  • Understanding Distributed Systems Without a PhD — CAP theorem, consistency models, and the fundamentals that underpin etcd
  • What Happens When You kubectl apply — end-to-end trace through API server, etcd, scheduler, and kubelet
  • The Backup Nobody Tested — why untested backups fail and how to build backup confidence