Answer Key: The Cluster That Disagrees With Itself¶
The System¶
A platform infrastructure where feature flags are managed as a Kubernetes ConfigMap, controlled by Terraform:
[Terraform] --manages--> [ConfigMap: feature-flags]
|
[platform namespace]
|
[api-servers] --read--> [etcd cluster (3 members)]
| |
[Application pods] raft consensus
|
checkout flow controlled by
enable_new_checkout flag
Terraform is the source of truth for infrastructure. The feature-flags ConfigMap controls application behavior (checkout flow version, cart limits). The etcd cluster stores the Kubernetes state.
What's Broken¶
Root cause: Terraform state drift caused by a manual ConfigMap edit during (or after) a period of etcd instability.
The sequence of events:
1. Dec 14, 22:15 — Terraform applies the ConfigMap with enable_new_checkout: "false" and max_cart_items: "25" (resourceVersion 288847)
2. Dec 15, 03:22 — etcd experiences instability: leadership transfers (term 46 to 47), 847 failed proposals. This may have been caused by a network partition, disk I/O issues, or maintenance
3. During or after the instability — someone manually edits the ConfigMap: kubectl edit configmap feature-flags -n platform, changing enable_new_checkout to "true" and max_cart_items to "50" (creating resourceVersion 289102)
4. Current state — etcd cluster is now healthy and consistent (all members at raft index 289341), but:
- The live ConfigMap says enable_new_checkout: "true" (resourceVersion 289102)
- Terraform state records enable_new_checkout: "false" (resourceVersion 288847)
- The application is serving the new checkout flow (which was not supposed to be live)
- The read-replica context shows the Terraform version (possibly a stale cache, different cluster, or the state before the manual edit)
Key clue: Two different resourceVersion values for the same ConfigMap from different contexts, combined with Terraform state showing different values than the live resource.
The Fix¶
Immediate (resolve the inconsistency)¶
-
Determine the intended state — is the new checkout supposed to be enabled?
-
If the manual edit was unauthorized, revert:
-
If the manual edit was intentional, update Terraform:
Then:
Permanent¶
-
Prevent manual edits with a ValidatingAdmissionWebhook or OPA/Gatekeeper policy:
-
Add Terraform drift detection to CI:
-
Investigate the stale read-replica context:
Verification¶
# Check Terraform state matches live
terraform plan
# Should show "No changes. Your infrastructure matches the configuration."
# Verify ConfigMap is consistent across contexts
kubectl get configmap feature-flags -n platform -o jsonpath='{.data}'
kubectl get configmap feature-flags -n platform -o jsonpath='{.data}' --context=read-replica
# Verify etcd health
etcdctl endpoint health --cluster
# Check application behavior matches intended feature flags
curl -s https://api.megacorp.io/api/v1/checkout/config
Artifact Decoder¶
| Artifact | What It Revealed | What Was Misleading |
|---|---|---|
| CLI Output | ConfigMap has different values from different contexts; etcd cluster looks healthy now | etcd status is clean — the partition has healed, hiding the history |
| Metrics | 12 leader changes and 847 failed proposals indicate past instability; checkout v2 is active | Current metrics look healthy; has_leader=1 and consistent raft index hide the drift |
| IaC Snippet | Terraform state says enable_new_checkout=false but live says true — state drift |
The Terraform code looks straightforward; the drift is between state and reality |
| Log Lines | Leadership transfer at 03:22 shows instability window; app confirms new checkout is live | The etcd leadership log looks like a routine election, not a crisis |
Skills Demonstrated¶
- Detecting Terraform state drift and understanding its implications
- Recognizing the signs of past etcd instability (leader changes, failed proposals)
- Understanding Kubernetes resourceVersion semantics
- Evaluating the risks of manual changes to Terraform-managed resources
- Thinking through the resolution of conflicting sources of truth