Answer Key: The Cluster That Disagrees With Itself¶

The System¶

A platform infrastructure where feature flags are managed as a Kubernetes ConfigMap, controlled by Terraform:

[Terraform] --manages--> [ConfigMap: feature-flags]
                               |
                          [platform namespace]
                               |
                    [api-servers] --read--> [etcd cluster (3 members)]
                         |                      |
                    [Application pods]     raft consensus
                         |
                    checkout flow controlled by
                    enable_new_checkout flag

Terraform is the source of truth for infrastructure. The feature-flags ConfigMap controls application behavior (checkout flow version, cart limits). The etcd cluster stores the Kubernetes state.

What's Broken¶

Root cause: Terraform state drift caused by a manual ConfigMap edit during (or after) a period of etcd instability.

The sequence of events: 1. Dec 14, 22:15 — Terraform applies the ConfigMap with enable_new_checkout: "false" and max_cart_items: "25" (resourceVersion 288847) 2. Dec 15, 03:22 — etcd experiences instability: leadership transfers (term 46 to 47), 847 failed proposals. This may have been caused by a network partition, disk I/O issues, or maintenance 3. During or after the instability — someone manually edits the ConfigMap: kubectl edit configmap feature-flags -n platform, changing enable_new_checkout to "true" and max_cart_items to "50" (creating resourceVersion 289102) 4. Current state — etcd cluster is now healthy and consistent (all members at raft index 289341), but: - The live ConfigMap says enable_new_checkout: "true" (resourceVersion 289102) - Terraform state records enable_new_checkout: "false" (resourceVersion 288847) - The application is serving the new checkout flow (which was not supposed to be live) - The read-replica context shows the Terraform version (possibly a stale cache, different cluster, or the state before the manual edit)

Key clue: Two different resourceVersion values for the same ConfigMap from different contexts, combined with Terraform state showing different values than the live resource.

The Fix¶

Immediate (resolve the inconsistency)¶

Determine the intended state — is the new checkout supposed to be enabled?

# Check with the team: was the manual edit intentional?
# If yes, update Terraform to match:
# If no, revert the manual edit:

If the manual edit was unauthorized, revert:

terraform apply -target=kubernetes_config_map.feature_flags
# This will reset the ConfigMap to Terraform's desired state

If the manual edit was intentional, update Terraform:

resource "kubernetes_config_map" "feature_flags" {
  data = {
    enable_new_checkout = "true"     # Updated from manual change
    enable_dark_mode    = "true"
    max_cart_items      = "50"        # Updated from manual change
  }
}

Then:

terraform plan    # Should show no changes (already in sync)
terraform apply

Permanent¶

Prevent manual edits with a ValidatingAdmissionWebhook or OPA/Gatekeeper policy:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sTerraformManaged
metadata:
  name: no-manual-configmap-edits
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["ConfigMap"]
    namespaces: ["platform"]
  parameters:
    labels: ["app.kubernetes.io/managed-by"]
    requiredValue: "terraform"

Add Terraform drift detection to CI:

# Run nightly or on schedule
terraform plan -detailed-exitcode
# Exit code 2 = drift detected

Investigate the stale read-replica context:

kubectl config get-contexts read-replica
# Is this pointing to a different cluster, a cached view, or a stale API server?

Verification¶

# Check Terraform state matches live
terraform plan
# Should show "No changes. Your infrastructure matches the configuration."

# Verify ConfigMap is consistent across contexts
kubectl get configmap feature-flags -n platform -o jsonpath='{.data}'
kubectl get configmap feature-flags -n platform -o jsonpath='{.data}' --context=read-replica

# Verify etcd health
etcdctl endpoint health --cluster

# Check application behavior matches intended feature flags
curl -s https://api.megacorp.io/api/v1/checkout/config

Artifact Decoder¶

Artifact	What It Revealed	What Was Misleading
CLI Output	ConfigMap has different values from different contexts; etcd cluster looks healthy now	etcd status is clean — the partition has healed, hiding the history
Metrics	12 leader changes and 847 failed proposals indicate past instability; checkout v2 is active	Current metrics look healthy; `has_leader=1` and consistent raft index hide the drift
IaC Snippet	Terraform state says `enable_new_checkout=false` but live says `true` — state drift	The Terraform code looks straightforward; the drift is between state and reality
Log Lines	Leadership transfer at 03:22 shows instability window; app confirms new checkout is live	The etcd leadership log looks like a routine election, not a crisis

Skills Demonstrated¶

Detecting Terraform state drift and understanding its implications
Recognizing the signs of past etcd instability (leader changes, failed proposals)
Understanding Kubernetes resourceVersion semantics
Evaluating the risks of manual changes to Terraform-managed resources
Thinking through the resolution of conflicting sources of truth

Answer Key: The Cluster That Disagrees With Itself¶

The System¶

What's Broken¶

The Fix¶

Immediate (resolve the inconsistency)¶

Permanent¶

Verification¶

Artifact Decoder¶

Skills Demonstrated¶

Prerequisite Topic Packs¶

Pages that link here¶