Skip to content

Portal | Level: L2: Operations | Topics: etcd | Domain: Kubernetes

etcd Operations Skill Check

Rate yourself 0-2 on each item: 0 = never done, 1 = done with help, 2 = confident

Health & Monitoring

  • Check etcd cluster health with etcdctl endpoint health
  • Check member list and leader status
  • Monitor etcd metrics (db size, WAL fsync duration, leader changes)
  • Set up alerts for etcd latency and disk space
  • Understand quorum requirements (N/2+1 for N members)

Backup & Restore

  • Take a snapshot with etcdctl snapshot save
  • Verify a snapshot with etcdctl snapshot status
  • Restore a cluster from a snapshot (single node and multi-node)
  • Set up automated periodic etcd backups
  • Test restore procedure in a non-production environment

Space Management

  • Check current database size and quota
  • Compact etcd revision history
  • Defragment etcd to reclaim disk space
  • Handle "mvcc: database space exceeded" alarm
  • Disarm the NOSPACE alarm after space recovery

Cluster Operations

  • Add a new member to an etcd cluster
  • Remove a failed member from the cluster
  • Understand learner (non-voting) members
  • Migrate etcd from stacked to external topology
  • Handle leader election issues

Troubleshooting

  • Diagnose slow etcd responses (disk latency, network)
  • Investigate "request timed out" errors
  • Debug split-brain scenarios
  • Check etcd logs for warning patterns
  • Use etcdctl get / etcdctl watch for debugging

Scoring

Score Level
0-6 Beginner — study the etcd primer and practice backups
7-12 Intermediate — practice restore and space management
13-18 Advanced — handle cluster member operations
19+ Expert — lead etcd incident response

Wiki Navigation