Ops Archaeology: The Cluster That Disagrees With Itself¶

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L3 Estimated time: 40 min Domains: etcd, Distributed Systems, Terraform State, Configuration Management

Artifact 1: CLI Output¶

$ etcdctl endpoint status --cluster -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://etcd-0:2379       | 8e9e05c52164694d | 3.5.12  | 4.1 MB  |     false |      false |        47 |     289341 |             289341 |        |
| https://etcd-1:2379       | 3a57933972cb8131 | 3.5.12  | 4.1 MB  |      true |      false |        47 |     289341 |             289341 |        |
| https://etcd-2:2379       | bf9071f4639c75cc | 3.5.12  | 4.1 MB  |     false |      false |        47 |     289341 |             289341 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

$ kubectl get configmap feature-flags -n platform -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: feature-flags
  namespace: platform
  resourceVersion: "289102"
data:
  enable_new_checkout: "true"
  enable_dark_mode: "true"
  max_cart_items: "50"

$ kubectl get configmap feature-flags -n platform -o yaml --context=read-replica
apiVersion: v1
kind: ConfigMap
metadata:
  name: feature-flags
  namespace: platform
  resourceVersion: "288847"
data:
  enable_new_checkout: "false"
  enable_dark_mode: "true"
  max_cart_items: "25"

Artifact 2: Metrics¶

# etcd cluster metrics
etcd_server_has_leader 1
etcd_server_leader_changes_seen_total 12
etcd_network_peer_round_trip_time_seconds{To="etcd-0",quantile="0.99"} 0.008
etcd_network_peer_round_trip_time_seconds{To="etcd-2",quantile="0.99"} 0.011
etcd_server_proposals_committed_total 289341
etcd_server_proposals_failed_total 847

# Application-level metrics
checkout_flow_version{version="v2_new"} 1
checkout_flow_version{version="v1_legacy"} 0

# Terraform state
# Last successful apply: 2024-12-14T22:15:03Z
# Resources managed: 47
# State file size: 182KB

Artifact 3: Infrastructure Code¶

# From: terraform/platform/feature-flags.tf
resource "kubernetes_config_map" "feature_flags" {
  metadata {
    name      = "feature-flags"
    namespace = "platform"
  }

  data = {
    enable_new_checkout = "false"
    enable_dark_mode    = "true"
    max_cart_items      = "25"
  }
}

# terraform.tfstate shows:
# kubernetes_config_map.feature_flags:
#   data.enable_new_checkout = "false"
#   data.max_cart_items = "25"
#   resourceVersion = "288847"

Artifact 4: Log Lines¶

[2024-12-15T03:22:14Z] etcd-1     | {"level":"warn","msg":"leadership transferred","from":"8e9e05c52164694d","to":"3a57933972cb8131","term":46}
[2024-12-15T03:22:18Z] api-server | W1215 03:22:18.847291 event.go:364] Server is becoming the leader, discarding stale reads
[2024-12-15T09:45:02Z] platform   | INFO  Feature flag 'enable_new_checkout' = true — serving new checkout flow

Your Mission¶

Reconstruct: What does this system do? What are its components and purpose?
Diagnose: What is currently broken or degraded, and why?
Propose: What would you do to fix it? What would you check first?