etcd: The Database That Runs Kubernetes
- lesson
- etcd
- raft-consensus
- distributed-systems
- kubernetes-control-plane
- backup/restore
- disk-performance
- monitoring ---# etcd — The Database That Runs Kubernetes
Topics: etcd, Raft consensus, distributed systems, Kubernetes control plane, backup/restore, disk performance, monitoring Level: L1–L2 (Foundations to Operations) Time: 75–90 minutes Prerequisites: None (Kubernetes basics explained where needed)
The Mission¶
It's 9:47 AM on a Wednesday. Developers are pinging your team in Slack: kubectl get pods
is taking 30 seconds. kubectl apply sometimes works, sometimes times out. The Kubernetes
dashboard is blank. Deployments are stuck. Nothing is scaling.
You check the API server logs:
Two lines. The entire cluster's brain is misfiring.
This isn't a pod problem. It isn't a networking problem. It's etcd — the database that every single Kubernetes operation reads from and writes to. And right now, it's drowning.
By the end of this lesson you'll understand: - What etcd actually stores (and what it doesn't) - How Raft consensus works — the algorithm that keeps etcd's data consistent across nodes - The etcdctl commands that diagnose problems in minutes instead of hours - Why your disk choice is the single most important etcd decision - How to back up and restore etcd (the one backup that matters most) - Why clusters should have 3 or 5 members, never 2 or 4 - The Prometheus metrics that predict etcd trouble before it hits
Part 1: What etcd Actually Is¶
Let's start with what you're looking at. etcd is a distributed key-value store. Think of it as a giant hash map that lives across multiple servers, where every server has an identical copy of the data, and they all agree on what the current state is before any write is accepted.
In a Kubernetes cluster, the API server is the only component that talks to etcd directly.
Every kubectl command you run goes through the API server, which reads from or writes to
etcd. The scheduler, the controller manager, kubelets — they all talk to the API server. None
of them touch etcd.
Name Origin: The name "etcd" is a mashup of the Unix
/etcdirectory (the traditional home of system configuration files) and "d" for distributed. Pronounced "et-see-dee." It describes exactly what it does: distributed/etc— configuration storage spread across machines.
What's in etcd (and what isn't)¶
Stored: Pod/Deployment/Service definitions, ConfigMaps, Secrets, RBAC policies, namespaces, node heartbeat leases, CRDs and their instances, network policies — everything that defines the cluster's desired state.
NOT stored: Container images, application logs, Prometheus metrics, persistent volume data. etcd holds the metadata — the blueprint, not the building.
Keys follow the pattern /registry/<resource-type>/<namespace>/<name>. A pod named
nginx-7d9fc in the default namespace lives at /registry/pods/default/nginx-7d9fc.
Mental Model: etcd is to Kubernetes what a hotel's reservation system is to the hotel. The system knows every guest, every room assignment, every checkout time. If the reservation system goes down, nobody can check in, check out, or switch rooms — even though the physical hotel is still standing and guests are still in their rooms. Existing guests are fine (running pods keep running). But nothing new can happen.
Part 2: The Incident — Finding the Bottleneck¶
Back to our 9:47 AM crisis. The API server is timing out. Let's diagnose.
First, you need etcdctl and the TLS certificates that etcd requires for authentication.
On a kubeadm cluster, the certs live in a predictable place:
export ETCDCTL_API=3
# Store the cert flags in a variable — you'll use these constantly
ETCD_CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key"
| Flag | What it is | Why it's needed |
|---|---|---|
--cacert |
Certificate Authority cert | Verifies the etcd server's identity |
--cert |
Client certificate | Proves you're allowed to talk to etcd |
--key |
Client private key | Cryptographic proof you own the cert |
Gotcha: If you're on a managed Kubernetes service (EKS, GKE, AKS), you can't directly access etcd at all. The cloud provider manages it for you. The commands in this lesson apply to self-managed clusters (kubeadm, k3s, bare metal, etc.).
Step 1: Is etcd alive?¶
Healthy output:
https://10.0.1.10:2379 is healthy: successfully committed proposal: took = 2.34ms
https://10.0.1.11:2379 is healthy: successfully committed proposal: took = 3.12ms
https://10.0.1.12:2379 is healthy: successfully committed proposal: took = 2.87ms
But during our incident, you see:
https://10.0.1.10:2379 is healthy: successfully committed proposal: took = 487.23ms
https://10.0.1.11:2379 is healthy: successfully committed proposal: took = 512.08ms
https://10.0.1.12:2379 is healthy: successfully committed proposal: took = 1.203s
Those took values should be under 10ms. Half a second means etcd is suffocating.
Remember: The health check mnemonic is H-S-M — Health, Status, Members. Three commands, three angles:
endpoint health(latency),endpoint status(DB size + who's leader),member list(quorum). If you can't remember the exact commands, remember the three letters.
Step 2: Who's the leader, and is it stable?¶
+---------------------------+------------------+---------+---------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT INDEX |
+---------------------------+------------------+---------+---------+-----------+------------+
| https://10.0.1.10:2379 | 8e9e05c52164694d | 3.5.12 | 3.8 GB | false | 4892841 |
| https://10.0.1.11:2379 | a7fa3b248c0217a | 3.5.12 | 3.8 GB | true | 4892841 |
| https://10.0.1.12:2379 | c1d2e3f4a5b6c7d8 | 3.5.12 | 3.8 GB | false | 4892838 |
+---------------------------+------------------+---------+---------+-----------+------------+
Two things jump out: DB size is 3.8 GB (getting high — the default quota is 2 GB, and even with it raised, 8 GB is the recommended max). And one follower is 3 raft indexes behind the others — a sign of replication lag.
Step 3: Check for alarms¶
If etcd has hit its storage quota, you'll see:
The NOSPACE alarm means etcd has stopped accepting writes. The entire Kubernetes cluster is
now read-only. No new pods. No config changes. No scaling.
Step 4: What's eating the space?¶
etcdctl get /registry --prefix --keys-only $ETCD_CERTS | \
awk -F/ '{print $3}' | sort | uniq -c | sort -rn | head -10
18,000+ events. That's usually the culprit. Kubernetes Events are verbose and accumulate fast, especially in busy clusters.
Under the Hood: Kubernetes creates Event objects for almost everything: pod scheduled, image pulled, container started, health check passed, volume mounted. In a 200-pod cluster with frequent deployments, that's thousands of events per hour. Events have a default TTL of 1 hour, but until compaction runs, their old revisions consume etcd space.
Flashcard Check #1¶
Cover the answers.
Q1: What key pattern does Kubernetes use to store objects in etcd?
/registry/<resource-type>/<namespace>/<name>. Example:/registry/pods/default/nginx-7d9fc.
Q2: Can the Kubernetes scheduler talk directly to etcd?
No. Only the API server communicates with etcd. All other components (scheduler, controller manager, kubelet) go through the API server.
Q3: If etcd goes down but nothing restarts, do running pods keep running?
Yes. Running pods continue because the kubelet keeps them alive locally. But nothing new can happen — no scaling, no scheduling, no config changes.
Part 3: Raft Consensus — How etcd Keeps Data Consistent¶
Every write to etcd must be agreed upon by a majority of cluster members before it's considered committed. This is the Raft consensus algorithm, and understanding it explains almost every etcd behavior you'll encounter in production.
The restaurant analogy¶
Imagine a restaurant chain with 3 locations. They share one menu. When the head chef (the leader) wants to add a new dish:
- The head chef writes the new dish on a card and sends copies to the other two locations
- Each location chef reviews it and says "yes, I've added it to my menu"
- Once the head chef hears back from at least one other chef (giving a majority of 2 out of 3), the dish is officially on the menu
- The head chef tells any remaining location about the committed change on the next check-in
If the head chef gets hit by a bus, the two remaining locations hold an emergency meeting. One of them becomes the new head chef. They have all the committed menu changes because no change was committed without a majority agreeing. The chain keeps running.
This is Raft. The "menu" is the data. The "dish card" is a log entry. The "majority" is quorum.
Trivia: Raft was created by Diego Ongaro and John Ousterhout at Stanford in 2013. The paper's full title is "In Search of an Understandable Consensus Algorithm" — they explicitly designed it as an alternative to Paxos, which was notoriously difficult to understand and implement correctly. In user studies, students learned Raft significantly faster than Paxos. This matters at 3 AM when your cluster is degraded and you need to reason about quorum — Raft is designed to be tractable under pressure.
The three roles¶
Every etcd member is in one of three states:
FOLLOWER ──(timeout, no heartbeat)──→ CANDIDATE ──(wins vote)──→ LEADER
↑ │ │
└───(discovers current leader)──────────┘ │
└───(receives heartbeat)─────────────────────────────────────────┘
- Leader: Handles all writes. Sends heartbeats to followers. There is exactly one leader.
- Follower: Receives replicated data from the leader. Can serve reads (in some configs).
- Candidate: A follower that hasn't heard from a leader and is trying to become one.
How a write happens¶
kubectl apply → API server → etcd leader
1. Leader writes to its WAL (uncommitted)
2. Leader sends AppendEntries to followers
3. Followers write to their WAL, acknowledge
4. Leader waits for quorum acknowledgment
5. Leader commits, responds to API server
Step 4 is why disk latency matters so much. Slow follower WAL writes = slow quorum = slow
API server = slow kubectl apply.
The quorum math¶
This is the most important arithmetic in distributed systems:
Quorum = floor(N/2) + 1
Translation: you need more than half the members to agree.
| Cluster Size | Quorum | Can Tolerate | Notes |
|---|---|---|---|
| 1 | 1 | 0 failures | Dev only. Any failure = total loss |
| 2 | 2 | 0 failures | Worse than 1. Both must agree, but either failing kills quorum |
| 3 | 2 | 1 failure | Production minimum |
| 4 | 3 | 1 failure | Same tolerance as 3, more overhead. Never do this |
| 5 | 3 | 2 failures | For critical clusters, cross-AZ deployments |
| 7 | 4 | 3 failures | Rare. More members = higher write latency |
Gotcha: A 2-member cluster is strictly worse than a single member for availability. With 1 member, you need 1 to agree (always available while it's up). With 2 members, you need 2 to agree — if either fails, the cluster can't accept writes. You've doubled your failure surface for zero benefit.
Interview Bridge: "Why should etcd clusters always have an odd number of members?" is a common Kubernetes interview question. The answer: even numbers require the same quorum as the next odd number but tolerate fewer failures. A 4-member cluster needs 3 for quorum (same as 5) but tolerates only 1 failure (vs 2 for 5). Even sizes add cost without improving fault tolerance.
Part 4: The Fix — Disk Latency Was the Killer¶
Back to our incident. The endpoint health check showed 500ms+ response times. Let's find
out why.
Checking disk performance¶
# Quick synthetic test on the etcd data directory
dd if=/dev/zero of=/var/lib/etcd/test bs=512 count=1000 oflag=dsync 2>&1 | tail -1
rm /var/lib/etcd/test
If this shows throughput below 50 MB/s, your disk is too slow for etcd under load.
The real metric to watch is the WAL fsync duration — how long it takes etcd to write its write-ahead log to disk and get confirmation that the data is durable. In Prometheus:
| WAL fsync p99 | Status | Action |
|---|---|---|
| < 10ms | Healthy | Normal operation |
| 10–50ms | Degraded | Investigate disk I/O, check for competing workloads |
| 50–100ms | Critical | Leader elections likely. Move to SSD immediately |
| > 100ms | Emergency | Cluster is in an election loop. Expect API server errors |
War Story: A team running Kubernetes on AWS used
gp3EBS volumes for their etcd data directory. gp3 provides 3,000 baseline IOPS — plenty for most workloads. But during a spike in Kubernetes API activity (a CI/CD pipeline running 200 parallel jobs), etcd's write-ahead log needed sustained sequential writes faster than gp3 could deliver. WAL fsync latency spiked above 100ms. The leader couldn't replicate fast enough. Followers timed out and triggered elections. The new leader also couldn't keep up. The cluster entered an election loop — leader elected, overwhelmed, new election, repeat — for 12 minutes. The fix: moving etcd toio2volumes with provisioned IOPS. The deeper fix: giving etcd its own dedicated volume separate from the OS and kubelet.
The fix we applied¶
In our incident, etcd was sharing a disk with the operating system, kubelet, and container image storage. A large image pull flooded the disk I/O, starving etcd's WAL writes.
# Check what's sharing the disk with etcd
df -h /var/lib/etcd
lsof +D /var/lib/etcd/../ # what else is writing to this filesystem
The immediate fix was migrating etcd's data to a dedicated NVMe volume. The longer-term fix was adding monitoring on WAL fsync latency with alerts at 10ms.
Performance tuning reference¶
| Parameter | Default | Production | Why |
|---|---|---|---|
| Storage | OS default | Dedicated NVMe SSD | WAL fsync is the #1 bottleneck |
--heartbeat-interval |
100ms | 500ms cross-AZ | Must exceed network RTT |
--election-timeout |
1000ms | 5–10x heartbeat | Too low = spurious elections |
--quota-backend-bytes |
2 GB | 8 GB | Default is too small for production |
Gotcha: Cross-AZ etcd members with default timeouts cause spurious elections. Network latency of 5–15ms between AZs eats into the 100ms heartbeat interval. Either increase timeouts or (better) keep all etcd members in the same AZ.
Part 5: Compaction and Defragmentation — Keeping etcd Healthy¶
etcd doesn't just store the current value of a key — it keeps a history of every revision. When you update a ConfigMap 50 times, etcd has 50 versions of that key. This is what enables watches (the API server can ask "what changed since revision 4892800?"), but it also means the database grows indefinitely without maintenance.
Compaction: deleting old history¶
Compaction tells etcd "I don't need any revisions older than X." Kubernetes runs auto- compaction every 5 minutes by default, but understanding the manual process matters for emergencies:
# Get the current revision
REV=$(etcdctl endpoint status --write-out=json $ETCD_CERTS | \
jq '.[0].Status.header.revision')
echo "Current revision: $REV"
# Compact everything up to this revision
etcdctl compact $REV $ETCD_CERTS
After compaction, the old revisions are marked as free — but the database file on disk doesn't shrink. That's where defragmentation comes in.
Defragmentation: reclaiming disk space¶
# IMPORTANT: defrag blocks ALL reads and writes on the target member
# Run on one member at a time, starting with non-leaders
# Defrag member 1 (non-leader)
etcdctl defrag --endpoints=https://10.0.1.10:2379 $ETCD_CERTS
# Wait for it to rejoin, verify health
etcdctl endpoint health --cluster $ETCD_CERTS
# Defrag member 2 (non-leader)
etcdctl defrag --endpoints=https://10.0.1.12:2379 $ETCD_CERTS
# Finally, defrag the leader (will trigger a leader election)
etcdctl defrag --endpoints=https://10.0.1.11:2379 $ETCD_CERTS
Gotcha: Never defrag all members simultaneously. A team once scripted
etcdctl defrag --endpoints=<all-three-members>thinking it would be faster. Defrag blocks all reads and writes on the target member for the duration — on a 4 GB database, that can be 30+ seconds. With all three members blocked, the cluster lost quorum and the API server returned errors for 45 seconds. Always defrag one at a time, non-leaders first, leader last.
The NOSPACE emergency sequence¶
When etcd hits its quota and enters alarm mode, memorize this sequence:
# 1. Confirm the alarm
etcdctl alarm list $ETCD_CERTS
# 2. Get current revision
REV=$(etcdctl endpoint status --write-out=json $ETCD_CERTS | \
jq '.[0].Status.header.revision')
# 3. Compact old revisions
etcdctl compact $REV $ETCD_CERTS
# 4. Defrag (one member at a time!)
etcdctl defrag --endpoints=https://10.0.1.10:2379 $ETCD_CERTS
# 5. Disarm the alarm
etcdctl alarm disarm $ETCD_CERTS
# 6. Verify writes work again
etcdctl put /test/healthcheck "ok" $ETCD_CERTS
etcdctl del /test/healthcheck $ETCD_CERTS
Remember: The NOSPACE fix sequence: A-C-D-D — Alarm (list), Compact, Defrag, Disarm. Four steps, always in this order. You'll probably need this at 3 AM at least once.
Under the Hood: etcd uses bbolt (a fork of BoltDB) as its storage engine — a copy-on-write B+ tree. When a key is updated, the old page isn't overwritten; a new page is written and the old one is marked free. This is what makes defragmentation necessary: deleted data leaves "holes" in the file that can only be reclaimed by rewriting the entire database into a new, compact file. It's the same reason SQLite has
VACUUM.
Flashcard Check #2¶
Q4: What's the default etcd database quota, and what happens when it's exceeded?
Default is 2 GB. When exceeded, etcd enters alarm mode and rejects all writes. The Kubernetes cluster becomes read-only — no new pods, no config changes, no scaling. Fix: compact, defrag, alarm disarm.
Q5: Why must you defrag one etcd member at a time?
Defrag blocks all reads and writes on the target member. If all members are defragging simultaneously, the cluster has no quorum and the API server returns errors.
Q6: What's the relationship between compaction and defragmentation?
Compaction removes old key revision history, marking space as free. Defragmentation reclaims that freed space on disk. You need both: compact first, then defrag.
Part 6: Backup and Restore — The One Backup That Matters Most¶
If you back up one thing in your entire infrastructure, back up etcd. Without an etcd backup, a total cluster failure means rebuilding everything from scratch — every Deployment, Service, Secret, RBAC rule, and CRD, from memory or (hopefully) from your GitOps repo.
Creating a snapshot¶
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 $ETCD_CERTS
Always verify the snapshot immediately:
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 9f8c2d1a | 482918 | 1247 | 4.2 MB |
+----------+----------+------------+------------+
War Story: A team had etcd backups running hourly via a cron job. For six months, the backups ran without complaint. When they finally needed to restore after a catastrophic failure, every snapshot was a zero-byte file. The backup script was writing to a volume that had quietly filled up months ago, and
etcdctl snapshot savewith no disk space produces an empty file without a non-zero exit code in some versions. The team rebuilt their entire cluster from their Helm charts and GitOps repo — a 14-hour process. Now they runetcdctl snapshot statusafter every backup and alert if the file size is below 1 MB.
Backup schedule¶
Minimum: hourly snapshots, 7-day retention, stored off-cluster. The key additions beyond
the basic snapshot save command: verify with snapshot status after every save, and copy
to object storage (S3, GCS) automatically. A cron job on each control plane node works.
Test restores monthly — an untested backup is not a backup.
Restoring from a snapshot¶
Restore is destructive — you're creating a new cluster from the snapshot. Every member must be restored. The steps:
# 1. Stop API server and etcd on ALL control plane nodes (kubeadm)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# 2. Restore on EACH member (same snapshot, different --name and --initial-advertise-peer-urls)
etcdctl snapshot restore /backup/etcd-20260323-094700.db \
--data-dir=/var/lib/etcd-restored \
--name=etcd-0 \
--initial-cluster="etcd-0=https://10.0.1.10:2380,etcd-1=https://10.0.1.11:2380,etcd-2=https://10.0.1.12:2380" \
--initial-advertise-peer-urls=https://10.0.1.10:2380
# 3. Point etcd config to new data dir, restore static pod manifests, verify
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
etcdctl endpoint health --cluster $ETCD_CERTS
Gotcha: Restoring on only one member out of three is a common mistake. The restored member has a different cluster identity than the others. The cluster won't elect a leader. Always restore on every member from the same snapshot.
Part 7: etcd in kubeadm Clusters¶
In kubeadm, etcd runs as a static pod (manifest at /etc/kubernetes/manifests/etcd.yaml).
Data lives in /var/lib/etcd/member/. When kubectl isn't responding (because etcd is down),
use crictl logs $(crictl ps --name etcd -q) to read etcd logs directly.
Certificate lifecycle — the silent killer¶
kubeadm etcd certificates expire after 1 year. No warning — one day everything works, the next day members refuse to talk with TLS handshake errors.
# Check all etcd cert expiry dates
for cert in /etc/kubernetes/pki/etcd/*.crt; do
echo "$cert: $(openssl x509 -in "$cert" -noout -enddate)"
done
# Renew and restart
kubeadm certs renew all
systemctl restart kubelet
Gotcha: Set a monitoring alert that fires 30 days before certificate expiry. Or set a calendar reminder. Anything is better than discovering expiry during a production outage.
Part 8: Monitoring etcd with Prometheus¶
etcd exposes metrics on port 2381 (or 2379 with a /metrics endpoint, depending on
configuration). These are the metrics that predict trouble before users notice.
The critical four¶
# 1. WAL fsync latency — the single best health indicator
histogram_quantile(0.99,
rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)
# Alert if > 10ms
# 2. Leader changes — stability indicator
increase(etcd_server_leader_changes_seen_total[1h])
# Alert if > 3 per hour
# 3. Database size — capacity planning
etcd_mvcc_db_total_size_in_bytes
# Alert if > 6GB (default max is 8GB)
# 4. Proposal failures — consensus health
rate(etcd_server_proposals_failed_total[5m])
# Alert if any failures
What rising raftTerm tells you¶
Every leader election increments the raft term. A raftTerm that's jumping by 2–3+ per hour means the cluster is unstable — either the network between control plane nodes is flaky, or the disk is too slow for the heartbeat interval. Investigate both immediately.
Flashcard Check #3¶
Q7: What is the single most important Prometheus metric for etcd health?
etcd_disk_wal_fsync_duration_seconds. If the p99 exceeds 10ms, the cluster is at risk of leader instability. This metric captures the bottleneck that causes most etcd performance problems.
Q8: How do you tell if etcd leader elections are happening too frequently?
Check the
raftTermviaetcdctl endpoint status— if it jumps by more than 2–3 per hour, the cluster is unstable. In Prometheus, monitorincrease(etcd_server_leader_changes_seen_total[1h]).
Q9: What's the correct sequence when etcd hits its storage quota?
A-C-D-D: Alarm list, Compact, Defrag (one member at a time), Disarm.
Part 9: Common Failure Modes at a Glance¶
| Failure | Symptoms | Fix |
|---|---|---|
| Quorum loss | API server errors, no scheduling, health check timeouts | Restore from snapshot; last resort: --force-new-cluster on a survivor |
| Slow disk | Frequent leader elections, WAL fsync > 50ms | Dedicated SSD/NVMe. On AWS, io1/io2 with provisioned IOPS |
| Cert expiry | TLS handshake errors, members can't communicate | kubeadm certs renew all && systemctl restart kubelet |
| Space exceeded | mvcc: database space exceeded, API 500s |
A-C-D-D: alarm list, compact, defrag, disarm |
| Network partition | Stale reads from minority side, brief dual-leader in monitoring | Resolve partition; Raft prevents true split-brain, members reconcile |
War Story: A team upgrading etcd on a 3-member cluster removed two members before adding replacements. Quorum was immediately lost. The API server went read-only. Recovery required
--force-new-clusteron the survivor, which lost an uncommitted RBAC change. The correct procedure: add the new member first, verify it's healthy, then remove the old one. One at a time. Always.
Part 10: The History — From CoreOS to CNCF¶
2013: Brandon Philips and CoreOS create etcd as the config store for CoreOS Linux. They choose Raft because, unlike Paxos, they can actually understand and debug it.
2014: Kubernetes launches. Google's team picks etcd over ZooKeeper (too complex, Java dependency) and Consul (newer, less proven).
2016: etcd v3 ships — a complete rewrite. REST API replaced with gRPC, storage gets MVCC and proper watches. Kubernetes only supports v3.
2018: Red Hat acquires CoreOS for ~$250 million. etcd graduates to a CNCF top-level project, alongside Kubernetes and Prometheus. IBM acquires Red Hat for $34 billion in 2019.
Trivia: Every write goes through the etcd leader — writes are not distributed across members. Adding more members (3 to 5 to 7) improves fault tolerance but actually increases write latency because more acknowledgments are needed for quorum.
Exercises¶
Exercise 1: Read etcd's state (quick win, 2 minutes)¶
You have a kubeadm cluster. List the top 5 resource types consuming space in etcd.
# Fill in the missing parts:
etcdctl get /registry --prefix --keys-only $ETCD_CERTS | \
_____ | sort | uniq -c | sort -rn | head -5
Answer
The `awk -F/ '{print $3}'` splits each key by `/` and prints the third field — the resource type (pods, deployments, configmaps, etc.).Exercise 2: Quorum math (think, don't code)¶
For each scenario, determine: can the cluster accept writes?
- 3-member cluster, 1 member down
- 3-member cluster, 2 members down
- 5-member cluster, 2 members down
- 5-member cluster, 3 members down
- 4-member cluster, 2 members down
Answers
1. **Yes.** 2 of 3 alive. Quorum = 2. Satisfied. 2. **No.** 1 of 3 alive. Quorum = 2. Not satisfied. Read-only. 3. **Yes.** 3 of 5 alive. Quorum = 3. Satisfied. 4. **No.** 2 of 5 alive. Quorum = 3. Not satisfied. 5. **No.** 2 of 4 alive. Quorum = 3 (floor(4/2) + 1 = 3). Not satisfied. This is why 4-member clusters are worse than 3 — same quorum as 5, but less fault tolerance.Exercise 3: Diagnose the incident (judgment call)¶
etcdctl endpoint status shows all three members at 7.9 GB DB size. etcdctl alarm list
shows NOSPACE on all members. One follower is 13 raft indexes behind the leader.
What's wrong, and what do you do — in order?
Answer
Database nearly at 8 GB max. NOSPACE alarm = read-only cluster. Fix in order: 1. Get current revision 2. `etcdctl compactExercise 4: Design a backup strategy (design)¶
Your company runs a 5-node etcd cluster backing a production Kubernetes cluster with 500 pods and frequent deployments. Design: backup frequency, retention, storage location, verification, and restore testing cadence.
Answer (one good approach)
- **Frequency:** Every 30 minutes (hourly is minimum; more frequent for active clusters) - **Retention:** 7 days locally, 30 days in S3/GCS with versioning - **Storage:** Dedicated local volume, replicated to object storage. Never on the same disk as etcd. - **Verification:** `etcdctl snapshot status` after every save; alert if total keys = 0 or file size < 1 MB - **Restore testing:** Monthly, to a separate isolated cluster. An untested backup is not a backup.Cheat Sheet¶
Health Assessment (run these first)¶
| What | Command |
|---|---|
| Cluster health + latency | etcdctl endpoint health --cluster |
| DB size + leader + raft index | etcdctl endpoint status -w table --cluster |
| Member list + peer URLs | etcdctl member list -w table |
| Check alarms | etcdctl alarm list |
| Certificate expiry | openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -enddate |
Backup & Restore¶
| What | Command |
|---|---|
| Create snapshot | etcdctl snapshot save /path/snapshot.db |
| Verify snapshot | etcdctl snapshot status /path/snapshot.db -w table |
| Restore (per member) | etcdctl snapshot restore snapshot.db --data-dir=/var/lib/etcd-new --name=<name> --initial-cluster=<...> --initial-advertise-peer-urls=<url> |
Maintenance¶
| What | Command |
|---|---|
| Compact to current revision | REV=$(etcdctl endpoint status -w json \| jq '.[0].Status.header.revision'); etcdctl compact $REV |
| Defrag (one member at a time!) | etcdctl defrag --endpoints=<single-endpoint> |
| Clear NOSPACE alarm | etcdctl alarm disarm |
| Audit space by resource type | etcdctl get /registry --prefix --keys-only \| awk -F/ '{print $3}' \| sort \| uniq -c \| sort -rn |
Prometheus Metrics to Alert On¶
| Metric | Threshold | Meaning |
|---|---|---|
etcd_disk_wal_fsync_duration_seconds p99 |
> 10ms | Disk too slow |
etcd_server_leader_changes_seen_total |
> 3/hr | Cluster unstable |
etcd_mvcc_db_total_size_in_bytes |
> 6 GB | Approaching quota |
etcd_server_proposals_failed_total |
Any > 0 | Consensus failures |
Takeaways¶
-
etcd is the single source of truth for Kubernetes. When etcd is slow, the API server is slow. When etcd is down, the cluster is brain-dead. Protect it accordingly.
-
Disk latency is the #1 etcd killer. WAL fsync > 10ms means trouble. Use dedicated SSDs. Monitor
wal_fsync_duration_secondsreligiously. -
Always run 3 or 5 members, never 2 or 4. Even numbers give you the same quorum requirement as the next odd number but tolerate fewer failures. It's all cost, no benefit.
-
Back up etcd and verify the backups. Hourly minimum. Store off-cluster. Run
snapshot statusafter every save. Test restores. An unverified backup is a false promise. -
The NOSPACE fix is A-C-D-D. Alarm list, Compact, Defrag (one at a time!), Disarm. You'll need this at 3 AM. Memorize it.
-
Raft consensus means majority rules. Writes need quorum to commit. Leader handles all writes. More members = more fault tolerance but higher write latency.
Related Lessons¶
- The Split-Brain Nightmare — deep dive into network partitions and consensus
- Understanding Distributed Systems Without a PhD — CAP theorem, consistency models, and the fundamentals that underpin etcd
- What Happens When You kubectl apply — end-to-end trace through API server, etcd, scheduler, and kubelet
- The Backup Nobody Tested — why untested backups fail and how to build backup confidence