Skip to content

Answer Key: The Cluster That Disagrees With Itself

The System

A platform infrastructure where feature flags are managed as a Kubernetes ConfigMap, controlled by Terraform:

[Terraform] --manages--> [ConfigMap: feature-flags]
                               |
                          [platform namespace]
                               |
                    [api-servers] --read--> [etcd cluster (3 members)]
                         |                      |
                    [Application pods]     raft consensus
                         |
                    checkout flow controlled by
                    enable_new_checkout flag

Terraform is the source of truth for infrastructure. The feature-flags ConfigMap controls application behavior (checkout flow version, cart limits). The etcd cluster stores the Kubernetes state.

What's Broken

Root cause: Terraform state drift caused by a manual ConfigMap edit during (or after) a period of etcd instability.

The sequence of events: 1. Dec 14, 22:15 — Terraform applies the ConfigMap with enable_new_checkout: "false" and max_cart_items: "25" (resourceVersion 288847) 2. Dec 15, 03:22 — etcd experiences instability: leadership transfers (term 46 to 47), 847 failed proposals. This may have been caused by a network partition, disk I/O issues, or maintenance 3. During or after the instability — someone manually edits the ConfigMap: kubectl edit configmap feature-flags -n platform, changing enable_new_checkout to "true" and max_cart_items to "50" (creating resourceVersion 289102) 4. Current state — etcd cluster is now healthy and consistent (all members at raft index 289341), but: - The live ConfigMap says enable_new_checkout: "true" (resourceVersion 289102) - Terraform state records enable_new_checkout: "false" (resourceVersion 288847) - The application is serving the new checkout flow (which was not supposed to be live) - The read-replica context shows the Terraform version (possibly a stale cache, different cluster, or the state before the manual edit)

Key clue: Two different resourceVersion values for the same ConfigMap from different contexts, combined with Terraform state showing different values than the live resource.

The Fix

Immediate (resolve the inconsistency)

  1. Determine the intended state — is the new checkout supposed to be enabled?

    # Check with the team: was the manual edit intentional?
    # If yes, update Terraform to match:
    # If no, revert the manual edit:
    

  2. If the manual edit was unauthorized, revert:

    terraform apply -target=kubernetes_config_map.feature_flags
    # This will reset the ConfigMap to Terraform's desired state
    

  3. If the manual edit was intentional, update Terraform:

    resource "kubernetes_config_map" "feature_flags" {
      data = {
        enable_new_checkout = "true"     # Updated from manual change
        enable_dark_mode    = "true"
        max_cart_items      = "50"        # Updated from manual change
      }
    }
    
    Then:
    terraform plan    # Should show no changes (already in sync)
    terraform apply
    

Permanent

  1. Prevent manual edits with a ValidatingAdmissionWebhook or OPA/Gatekeeper policy:

    apiVersion: constraints.gatekeeper.sh/v1beta1
    kind: K8sTerraformManaged
    metadata:
      name: no-manual-configmap-edits
    spec:
      match:
        kinds:
          - apiGroups: [""]
            kinds: ["ConfigMap"]
        namespaces: ["platform"]
      parameters:
        labels: ["app.kubernetes.io/managed-by"]
        requiredValue: "terraform"
    

  2. Add Terraform drift detection to CI:

    # Run nightly or on schedule
    terraform plan -detailed-exitcode
    # Exit code 2 = drift detected
    

  3. Investigate the stale read-replica context:

    kubectl config get-contexts read-replica
    # Is this pointing to a different cluster, a cached view, or a stale API server?
    

Verification

# Check Terraform state matches live
terraform plan
# Should show "No changes. Your infrastructure matches the configuration."

# Verify ConfigMap is consistent across contexts
kubectl get configmap feature-flags -n platform -o jsonpath='{.data}'
kubectl get configmap feature-flags -n platform -o jsonpath='{.data}' --context=read-replica

# Verify etcd health
etcdctl endpoint health --cluster

# Check application behavior matches intended feature flags
curl -s https://api.megacorp.io/api/v1/checkout/config

Artifact Decoder

Artifact What It Revealed What Was Misleading
CLI Output ConfigMap has different values from different contexts; etcd cluster looks healthy now etcd status is clean — the partition has healed, hiding the history
Metrics 12 leader changes and 847 failed proposals indicate past instability; checkout v2 is active Current metrics look healthy; has_leader=1 and consistent raft index hide the drift
IaC Snippet Terraform state says enable_new_checkout=false but live says true — state drift The Terraform code looks straightforward; the drift is between state and reality
Log Lines Leadership transfer at 03:22 shows instability window; app confirms new checkout is live The etcd leadership log looks like a routine election, not a crisis

Skills Demonstrated

  • Detecting Terraform state drift and understanding its implications
  • Recognizing the signs of past etcd instability (leader changes, failed proposals)
  • Understanding Kubernetes resourceVersion semantics
  • Evaluating the risks of manual changes to Terraform-managed resources
  • Thinking through the resolution of conflicting sources of truth

Prerequisite Topic Packs