Ops Archaeology: The Cluster That Disagrees With Itself¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L3 Estimated time: 40 min Domains: etcd, Distributed Systems, Terraform State, Configuration Management
Artifact 1: CLI Output¶
$ etcdctl endpoint status --cluster -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://etcd-0:2379 | 8e9e05c52164694d | 3.5.12 | 4.1 MB | false | false | 47 | 289341 | 289341 | |
| https://etcd-1:2379 | 3a57933972cb8131 | 3.5.12 | 4.1 MB | true | false | 47 | 289341 | 289341 | |
| https://etcd-2:2379 | bf9071f4639c75cc | 3.5.12 | 4.1 MB | false | false | 47 | 289341 | 289341 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
$ kubectl get configmap feature-flags -n platform -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: feature-flags
namespace: platform
resourceVersion: "289102"
data:
enable_new_checkout: "true"
enable_dark_mode: "true"
max_cart_items: "50"
$ kubectl get configmap feature-flags -n platform -o yaml --context=read-replica
apiVersion: v1
kind: ConfigMap
metadata:
name: feature-flags
namespace: platform
resourceVersion: "288847"
data:
enable_new_checkout: "false"
enable_dark_mode: "true"
max_cart_items: "25"
Artifact 2: Metrics¶
# etcd cluster metrics
etcd_server_has_leader 1
etcd_server_leader_changes_seen_total 12
etcd_network_peer_round_trip_time_seconds{To="etcd-0",quantile="0.99"} 0.008
etcd_network_peer_round_trip_time_seconds{To="etcd-2",quantile="0.99"} 0.011
etcd_server_proposals_committed_total 289341
etcd_server_proposals_failed_total 847
# Application-level metrics
checkout_flow_version{version="v2_new"} 1
checkout_flow_version{version="v1_legacy"} 0
# Terraform state
# Last successful apply: 2024-12-14T22:15:03Z
# Resources managed: 47
# State file size: 182KB
Artifact 3: Infrastructure Code¶
# From: terraform/platform/feature-flags.tf
resource "kubernetes_config_map" "feature_flags" {
metadata {
name = "feature-flags"
namespace = "platform"
}
data = {
enable_new_checkout = "false"
enable_dark_mode = "true"
max_cart_items = "25"
}
}
# terraform.tfstate shows:
# kubernetes_config_map.feature_flags:
# data.enable_new_checkout = "false"
# data.max_cart_items = "25"
# resourceVersion = "288847"
Artifact 4: Log Lines¶
[2024-12-15T03:22:14Z] etcd-1 | {"level":"warn","msg":"leadership transferred","from":"8e9e05c52164694d","to":"3a57933972cb8131","term":46}
[2024-12-15T03:22:18Z] api-server | W1215 03:22:18.847291 event.go:364] Server is becoming the leader, discarding stale reads
[2024-12-15T09:45:02Z] platform | INFO Feature flag 'enable_new_checkout' = true — serving new checkout flow
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?