Thinking Out Loud: Terraform¶
A senior SRE's internal monologue while working through a real Terraform task. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
A developer merged a Terraform PR that passed plan in CI but when applied in the CD pipeline, it tried to destroy and recreate the production RDS database. The apply was caught by a manual approval step, but we need to understand why the plan showed a safe change and the apply tried to destroy the database.
The Monologue¶
Terraform plan showed safe changes, apply wants to destroy RDS. This is one of the most dangerous Terraform scenarios. Let me understand what happened before anyone runs any more applies.
First, let me look at the exact plan output from CI.
cd terraform/environments/prod
terraform plan -out=plan.tfplan
terraform show plan.tfplan | grep -A 5 "aws_db_instance"
The CI plan shows ~ update in-place for the RDS instance — a parameter change, not a destroy. But the apply in the CD pipeline is showing - destroy then + create. Different behavior for the same code. Why?
The most common reason: the state changed between plan and apply. Someone ran a terraform apply between the CI plan and the CD apply, which changed the state file. The CD pipeline planned against one state and is applying against another.
Let me check the state lock history.
aws dynamodb query --table-name terraform-locks \
--key-condition-expression "LockID = :lid" \
--expression-attribute-values '{":lid":{"S":"terraform-state/prod/terraform.tfstate"}}' \
--output json | jq '.Items[-5:]'
Mental Model: Plan/Apply State Drift¶
terraform plangenerates a plan against the CURRENT state at plan time. If the state changes between plan and apply (another apply, manual state edit, import), the plan is stale. This is whyterraform plan -out=plan.tfplanandterraform apply plan.tfplanshould use the SAME plan file. Never runterraform applywithout using the plan file from the most recentterraform planagainst the current state.
Actually, let me check something more fundamental. Let me see what actually changed in the code.
The change modifies engine_version from 14.9 to 15.4. A major PostgreSQL version upgrade. Let me check what Terraform does with RDS engine version changes.
"forces replacement" — there it is. Changing engine_version from 14.x to 15.x on an RDS instance is a REPLACEMENT operation, not an in-place update. Terraform will destroy the old database and create a new one. But the CI plan showed update-in-place... unless the CI was running against a different state or different variables.
Wait, let me check the CI log more carefully. Maybe it was running against the staging state, not prod.
The CI workflow runs terraform plan against the prod workspace. But... it uses TF_VAR_engine_version from a CI variable. Let me check.
No CI override for engine_version. But let me check the workspace variables.
The tfvars file has engine_version = "15.4". But let me check if there's a terraform.auto.tfvars or a -var-file that the CI uses differently.
ls terraform/environments/prod/*.tfvars*
cat terraform/environments/prod/prod.auto.tfvars 2>/dev/null
No auto.tfvars. The plan should have shown the replacement. Let me re-run the plan and look carefully.
terraform plan -detailed-exitcode 2>&1 | tee /tmp/plan-output.txt
grep -i "replace\|destroy\|forces" /tmp/plan-output.txt
"aws_db_instance.main must be replaced" — it shows up in the current plan. So the CI plan DID show a replacement, but the developer or the CI log viewer missed it. The word "replaced" was buried in a large plan output.
Mental Model: Always Read the Full Plan¶
terraform planoutput can be long. Developers (and CI systems) often scan for "Plan: X to add, Y to change, Z to destroy" at the bottom. But a REPLACEMENT shows as "1 to add, 1 to destroy" — the same as a legitimate create + remove. You MUST check for# must be replacedandforces replacementin the body. In CI, add a grep for "must be replaced" as a gate.
Okay, so the plan was correct — it showed replacement. The developer just didn't read it carefully. But the question remains: how do we prevent this from destroying the production database?
The correct approach for RDS major version upgrades is to use the AWS-native upgrade mechanism, not Terraform replacement. Let me check if there's a way to do this in Terraform without a replace.
Current state: engine_version = "14.9". I need to upgrade to 15.4 using AWS's in-place upgrade, then update the Terraform state to match. Here's the plan:
1. Use the AWS console or CLI to perform the major version upgrade (in-place, with snapshot)
2. After the upgrade completes, update the Terraform code to engine_version = "15.4"
3. Run terraform plan — it should show no changes
4. If there's drift, use terraform apply -refresh-only to sync state
# Step 1: AWS-native upgrade
aws rds modify-db-instance \
--db-instance-identifier prod-main-db \
--engine-version 15.4 \
--allow-major-version-upgrade \
--apply-immediately \
--no-auto-minor-version-upgrade
Wait — --apply-immediately on production? That'll cause downtime. Let me schedule it for the maintenance window instead.
aws rds modify-db-instance \
--db-instance-identifier prod-main-db \
--engine-version 15.4 \
--allow-major-version-upgrade \
--no-apply-immediately
This will apply during the next maintenance window. After the upgrade, I'll update the Terraform code and run a plan to verify state convergence.
Now, for the preventive control — how do we prevent Terraform from ever replacing the RDS instance?
# Add lifecycle block to prevent destruction
cat terraform/modules/rds/main.tf | grep -A 5 lifecycle
There's no lifecycle block. I need to add one.
resource "aws_db_instance" "main" {
# ... existing config ...
lifecycle {
prevent_destroy = true
}
}
prevent_destroy = true makes Terraform refuse to destroy this resource, even if the plan says it needs replacement. This is a safety net for stateful resources that should never be destroyed through Terraform.
# Also add an ignore_changes for engine_version since we handle upgrades out-of-band
resource "aws_db_instance" "main" {
lifecycle {
prevent_destroy = true
ignore_changes = [engine_version]
}
}
Wait — ignore_changes = [engine_version] means Terraform will never try to change the engine version. That means the Terraform code won't reflect reality after an upgrade unless I manually update it. That's a drift risk. Let me think about this differently.
The better approach: use prevent_destroy as a safety net, but DON'T ignore engine_version. Instead, when we do major upgrades, we do them out-of-band with AWS CLI, then update the Terraform code to match. The plan should show "no changes." If someone tries to do a version change through Terraform, prevent_destroy blocks the replacement.
Mental Model: Stateful Resources Need Lifecycle Guards¶
Databases, storage volumes, DNS zones — anything that holds data — should have
prevent_destroy = truein Terraform. Major version upgrades of databases should be done through the cloud provider's native upgrade mechanism, not through Terraform replace. Terraform should follow reality for these resources, not lead it.
Let me commit the lifecycle change.
One more thing — I need to add a CI check that catches must be replaced in the plan output for production environments. This should fail the pipeline, not just warn.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
Run terraform apply and let it replace the database |
Recognize that RDS major version upgrades need the AWS-native upgrade, not Terraform replace | Terraform replace = destroy + create = data loss for databases |
| Skim the plan summary line at the bottom | Search the full plan output for must be replaced and forces replacement |
The summary line doesn't distinguish between safe creates and dangerous replacements |
Not know about prevent_destroy |
Add prevent_destroy = true to all stateful resources |
This is the single most important lifecycle guard for production databases |
| Fix this instance and hope it doesn't happen again | Add a CI check that fails on must be replaced in production plans |
Process controls prevent the next person from making the same mistake |
Key Heuristics Used¶
- Plan/Apply State Drift: Always use
terraform plan -outandterraform apply <plan>with the same plan file. State changes between plan and apply invalidate the plan. - Read the Full Plan: Scan for
must be replacedandforces replacementin the plan body. The summary line at the bottom can hide dangerous replacements. - Stateful Resources Need Lifecycle Guards: Databases, volumes, and DNS zones should always have
prevent_destroy = true. Upgrades should use the provider's native mechanism.
Cross-References¶
- Primer — Terraform plan/apply lifecycle, state management, and resource lifecycle
- Street Ops — State manipulation, import procedures, and drift reconciliation
- Footguns —
forces replacementon databases, state drift between plan and apply, and missing lifecycle blocks