Skip to content

Progressive Hints

Hint 1 (after 5 min)

Look at the two kubectl get configmap outputs. The same ConfigMap (feature-flags) has different values depending on which context you query. From the default context: enable_new_checkout: "true", max_cart_items: "50". From the read-replica context: enable_new_checkout: "false", max_cart_items: "25". The resourceVersion is also different (289102 vs 288847).

Hint 2 (after 10 min)

The etcd cluster currently looks healthy — leader elected, all members in sync at raft index 289341. But etcd_server_leader_changes_seen_total is 12, which is high, and etcd_server_proposals_failed_total is 847 — there were failed proposals. The log shows a leadership transfer at 03:22. There was a period of instability. Now compare the Terraform state: it records enable_new_checkout = "false" at resourceVersion: "288847". But the live ConfigMap has enable_new_checkout = "true" at resourceVersion: "289102". Someone manually changed the ConfigMap after Terraform applied it.

Hint 3 (after 15 min)

The full picture: this is a platform with feature flags managed in a Kubernetes ConfigMap. Terraform defines the ConfigMap with enable_new_checkout = "false". During a network partition or etcd instability (12 leader changes, 847 failed proposals), someone manually kubectl edit'd the ConfigMap — changing enable_new_checkout to true and max_cart_items to 50. The etcd cluster healed and currently shows consistent state, but: 1. The live ConfigMap no longer matches the Terraform state (drift) 2. The read-replica context may be hitting a stale API server cache or a different cluster 3. The application is serving the new checkout flow (v2_new) which was not supposed to be enabled yet 4. Next terraform apply will revert the manual change, potentially breaking the live checkout flow