Skip to content

Thinking Out Loud: Terraform

A senior SRE's internal monologue while working through a real Terraform task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

A developer merged a Terraform PR that passed plan in CI but when applied in the CD pipeline, it tried to destroy and recreate the production RDS database. The apply was caught by a manual approval step, but we need to understand why the plan showed a safe change and the apply tried to destroy the database.

The Monologue

Terraform plan showed safe changes, apply wants to destroy RDS. This is one of the most dangerous Terraform scenarios. Let me understand what happened before anyone runs any more applies.

First, let me look at the exact plan output from CI.

cd terraform/environments/prod
terraform plan -out=plan.tfplan
terraform show plan.tfplan | grep -A 5 "aws_db_instance"

The CI plan shows ~ update in-place for the RDS instance — a parameter change, not a destroy. But the apply in the CD pipeline is showing - destroy then + create. Different behavior for the same code. Why?

The most common reason: the state changed between plan and apply. Someone ran a terraform apply between the CI plan and the CD apply, which changed the state file. The CD pipeline planned against one state and is applying against another.

Let me check the state lock history.

aws dynamodb query --table-name terraform-locks \
  --key-condition-expression "LockID = :lid" \
  --expression-attribute-values '{":lid":{"S":"terraform-state/prod/terraform.tfstate"}}' \
  --output json | jq '.Items[-5:]'

Mental Model: Plan/Apply State Drift

terraform plan generates a plan against the CURRENT state at plan time. If the state changes between plan and apply (another apply, manual state edit, import), the plan is stale. This is why terraform plan -out=plan.tfplan and terraform apply plan.tfplan should use the SAME plan file. Never run terraform apply without using the plan file from the most recent terraform plan against the current state.

Actually, let me check something more fundamental. Let me see what actually changed in the code.

git log --oneline -5 -- terraform/
git diff HEAD~1 -- terraform/environments/prod/

The change modifies engine_version from 14.9 to 15.4. A major PostgreSQL version upgrade. Let me check what Terraform does with RDS engine version changes.

terraform plan 2>&1 | grep -B 2 -A 10 "aws_db_instance"

"forces replacement" — there it is. Changing engine_version from 14.x to 15.x on an RDS instance is a REPLACEMENT operation, not an in-place update. Terraform will destroy the old database and create a new one. But the CI plan showed update-in-place... unless the CI was running against a different state or different variables.

Wait, let me check the CI log more carefully. Maybe it was running against the staging state, not prod.

cat .github/workflows/terraform.yml | grep -A 20 "terraform plan"

The CI workflow runs terraform plan against the prod workspace. But... it uses TF_VAR_engine_version from a CI variable. Let me check.

cat .github/workflows/terraform.yml | grep -i engine_version

No CI override for engine_version. But let me check the workspace variables.

terraform workspace show
cat terraform/environments/prod/terraform.tfvars | grep engine

The tfvars file has engine_version = "15.4". But let me check if there's a terraform.auto.tfvars or a -var-file that the CI uses differently.

ls terraform/environments/prod/*.tfvars*
cat terraform/environments/prod/prod.auto.tfvars 2>/dev/null

No auto.tfvars. The plan should have shown the replacement. Let me re-run the plan and look carefully.

terraform plan -detailed-exitcode 2>&1 | tee /tmp/plan-output.txt
grep -i "replace\|destroy\|forces" /tmp/plan-output.txt

"aws_db_instance.main must be replaced" — it shows up in the current plan. So the CI plan DID show a replacement, but the developer or the CI log viewer missed it. The word "replaced" was buried in a large plan output.

Mental Model: Always Read the Full Plan

terraform plan output can be long. Developers (and CI systems) often scan for "Plan: X to add, Y to change, Z to destroy" at the bottom. But a REPLACEMENT shows as "1 to add, 1 to destroy" — the same as a legitimate create + remove. You MUST check for # must be replaced and forces replacement in the body. In CI, add a grep for "must be replaced" as a gate.

Okay, so the plan was correct — it showed replacement. The developer just didn't read it carefully. But the question remains: how do we prevent this from destroying the production database?

The correct approach for RDS major version upgrades is to use the AWS-native upgrade mechanism, not Terraform replacement. Let me check if there's a way to do this in Terraform without a replace.

terraform state show aws_db_instance.main | grep engine_version

Current state: engine_version = "14.9". I need to upgrade to 15.4 using AWS's in-place upgrade, then update the Terraform state to match. Here's the plan:

1. Use the AWS console or CLI to perform the major version upgrade (in-place, with snapshot) 2. After the upgrade completes, update the Terraform code to engine_version = "15.4" 3. Run terraform plan — it should show no changes 4. If there's drift, use terraform apply -refresh-only to sync state

# Step 1: AWS-native upgrade
aws rds modify-db-instance \
  --db-instance-identifier prod-main-db \
  --engine-version 15.4 \
  --allow-major-version-upgrade \
  --apply-immediately \
  --no-auto-minor-version-upgrade

Wait — --apply-immediately on production? That'll cause downtime. Let me schedule it for the maintenance window instead.

aws rds modify-db-instance \
  --db-instance-identifier prod-main-db \
  --engine-version 15.4 \
  --allow-major-version-upgrade \
  --no-apply-immediately

This will apply during the next maintenance window. After the upgrade, I'll update the Terraform code and run a plan to verify state convergence.

Now, for the preventive control — how do we prevent Terraform from ever replacing the RDS instance?

# Add lifecycle block to prevent destruction
cat terraform/modules/rds/main.tf | grep -A 5 lifecycle

There's no lifecycle block. I need to add one.

resource "aws_db_instance" "main" {
  # ... existing config ...

  lifecycle {
    prevent_destroy = true
  }
}

prevent_destroy = true makes Terraform refuse to destroy this resource, even if the plan says it needs replacement. This is a safety net for stateful resources that should never be destroyed through Terraform.

# Also add an ignore_changes for engine_version since we handle upgrades out-of-band
resource "aws_db_instance" "main" {
  lifecycle {
    prevent_destroy = true
    ignore_changes = [engine_version]
  }
}

Wait — ignore_changes = [engine_version] means Terraform will never try to change the engine version. That means the Terraform code won't reflect reality after an upgrade unless I manually update it. That's a drift risk. Let me think about this differently.

The better approach: use prevent_destroy as a safety net, but DON'T ignore engine_version. Instead, when we do major upgrades, we do them out-of-band with AWS CLI, then update the Terraform code to match. The plan should show "no changes." If someone tries to do a version change through Terraform, prevent_destroy blocks the replacement.

Mental Model: Stateful Resources Need Lifecycle Guards

Databases, storage volumes, DNS zones — anything that holds data — should have prevent_destroy = true in Terraform. Major version upgrades of databases should be done through the cloud provider's native upgrade mechanism, not through Terraform replace. Terraform should follow reality for these resources, not lead it.

Let me commit the lifecycle change.

cd terraform/modules/rds
# Add prevent_destroy to the RDS resource

One more thing — I need to add a CI check that catches must be replaced in the plan output for production environments. This should fail the pipeline, not just warn.

What Made This Senior-Level

Junior Would... Senior Does... Why
Run terraform apply and let it replace the database Recognize that RDS major version upgrades need the AWS-native upgrade, not Terraform replace Terraform replace = destroy + create = data loss for databases
Skim the plan summary line at the bottom Search the full plan output for must be replaced and forces replacement The summary line doesn't distinguish between safe creates and dangerous replacements
Not know about prevent_destroy Add prevent_destroy = true to all stateful resources This is the single most important lifecycle guard for production databases
Fix this instance and hope it doesn't happen again Add a CI check that fails on must be replaced in production plans Process controls prevent the next person from making the same mistake

Key Heuristics Used

  1. Plan/Apply State Drift: Always use terraform plan -out and terraform apply <plan> with the same plan file. State changes between plan and apply invalidate the plan.
  2. Read the Full Plan: Scan for must be replaced and forces replacement in the plan body. The summary line at the bottom can hide dangerous replacements.
  3. Stateful Resources Need Lifecycle Guards: Databases, volumes, and DNS zones should always have prevent_destroy = true. Upgrades should use the provider's native mechanism.

Cross-References

  • Primer — Terraform plan/apply lifecycle, state management, and resource lifecycle
  • Street Ops — State manipulation, import procedures, and drift reconciliation
  • Footgunsforces replacement on databases, state drift between plan and apply, and missing lifecycle blocks