- devops
- l2
- runbook
- terraform
- terraform-deep-dive --- Portal | Level: L2: Operations | Topics: Terraform, Terraform Deep Dive | Domain: DevOps & Tooling
Runbook: Terraform Drift Detection Response¶
| Field | Value |
|---|---|
| Domain | Cloud/Terraform |
| Alert | terraform plan shows unexpected resource changes, or drift detection job reports changes |
| Severity | P2 |
| Est. Resolution Time | 30-60 minutes |
| Escalation Timeout | 45 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | Terraform CLI, cloud provider CLI, state file access, ability to run terraform plan/apply |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
terraform plan -detailed-exitcode 2>&1 | tail -20
echo "Exit code: $?"
Step 1: Capture the Drift Plan¶
Why: Before making any decisions, you need a full picture of every drifted resource. Saving the plan to a file ensures you can review it carefully and act on it deterministically.
# Run terraform plan with output saved to a file:
terraform plan -out=drift.tfplan 2>&1 | tee drift-plan-output.txt
# View a human-readable summary of what will change:
cat drift-plan-output.txt
# For a more structured view:
terraform show drift.tfplan
# Get a count of changes by type:
grep -E "^ # |will be|must be" drift-plan-output.txt | head -50
Terraform will display the drift as a plan:
# aws_security_group.web will be updated in-place
~ resource "aws_security_group" "web" {
~ ingress {
+ cidr_blocks = ["0.0.0.0/0"] ← someone opened this to the world manually
}
}
# aws_instance.app[0] has been deleted
- resource "aws_instance" "app" { ← someone terminated this instance
}
Plan: 0 to add, 1 to change, 1 to destroy.
terraform plan fails with credential errors, check aws sts get-caller-identity (AWS) or gcloud auth list (GCP) to confirm your credentials are valid.
Step 2: Review Each Changed Resource — Expected or Manual Intervention?¶
Why: Not all drift is bad. Some changes are deliberate (a team manually patched a config) and should be codified in Terraform. Others are accidents or security incidents. You must categorize each change before deciding how to respond.
# For each changed resource in the plan, determine:
# Question 1: Was this change authorized?
# Question 2: Is it safe to apply (Terraform would "fix" it back to code)?
# Question 3: Is it safe to import (Terraform code should be updated to match current reality)?
# Categories:
# EXPECTED → someone deliberately made this change and it should be kept
# ACCIDENT → someone made this change by mistake; revert with terraform apply
# SECURITY → unauthorized or dangerous change; escalate before acting
# UNKNOWN → unclear who/why; investigate before acting
# Example: security group with 0.0.0.0/0 added → investigate immediately
# Example: instance count changed → check if autoscaling or manual action
echo "Manually categorize each change in drift-plan-output.txt before proceeding"
A categorized list:
aws_security_group.web: ingress 0.0.0.0/0 added → SECURITY — escalate
aws_instance.app[0]: deleted → UNKNOWN — investigate
aws_s3_bucket.logs: tags changed → EXPECTED — team updated tags manually, should codify
destroy operations first (highest risk), then update operations that affect security resources (security groups, IAM, network ACLs).
Step 3: Investigate the Source of the Change¶
Why: Drift caused by an authorized team member requires a different response than drift caused by an unauthorized actor or a runaway script.
# AWS CloudTrail — find who made the manual change:
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=<RESOURCE_ID> \
--start-time $(date -d '7 days ago' --iso-8601=seconds) \
--output json | \
jq '.Events[] | {Username: .Username, EventName: .EventName, EventTime: .EventTime}'
# GCP Audit Logs — find who changed the resource:
gcloud logging read \
'resource.labels.resource_name="<RESOURCE_NAME>"' \
--limit=10 \
--format="table(timestamp, protoPayload.methodName, protoPayload.authenticationInfo.principalEmail)"
# Azure Activity Log:
az monitor activity-log list \
--resource-id <RESOURCE_ID> \
--start-time $(date -d '7 days ago' --iso-8601=seconds) \
--query "[].{Caller:caller, Operation:operationName.localizedValue, Time:eventTimestamp}"
CloudTrail shows:
{
"Username": "alice",
"EventName": "AuthorizeSecurityGroupIngress",
"EventTime": "2026-03-18T22:15:00Z"
}
This tells you: alice opened port 0.0.0.0/0 yesterday at 10:15 PM — contact alice for context.
Step 4: Decide — Apply (Revert to Code) or Import (Codify the Change)¶
Why: There are two valid responses to drift. You must choose consciously — accidentally applying when you should import destroys intentional manual changes.
# DECISION FRAMEWORK:
# Use terraform apply (revert to Terraform code) when:
# - The manual change was accidental or unauthorized
# - The Terraform code represents the desired state
# - A security misconfiguration must be reverted immediately
# Use terraform import + code update (accept the change) when:
# - The manual change was authorized and should be permanent
# - The Terraform code needs to be updated to match reality
# - The change cannot be easily reverted without service impact
# To preview what "apply" will change without executing:
terraform show drift.tfplan | grep -E "will be|must be|will be destroyed"
# To import a resource (example: an S3 bucket):
terraform import aws_s3_bucket.<RESOURCE_NAME> <BUCKET_NAME>
# To import an EC2 instance:
terraform import aws_instance.<RESOURCE_NAME> <INSTANCE_ID>
# To import an AWS security group:
terraform import aws_security_group.<RESOURCE_NAME> <SG_ID>
echo "Document your decision: APPLY or IMPORT, and the reason, in the incident log"
For terraform import:
"aws_s3_bucket.<RESOURCE_NAME>: Importing from ID..."
"aws_s3_bucket.<RESOURCE_NAME>: Import complete! The resource has been imported."
"aws_s3_bucket.<RESOURCE_NAME>: Refreshing state..."
terraform import fails, the resource ID may be wrong. Check the cloud console for the exact resource identifier format required by the Terraform provider.
Step 5A: Apply to Revert (if drift is accidental/unauthorized)¶
Why: When Terraform code is the source of truth and the manual change should not persist, apply the saved plan to bring infrastructure back to the desired state.
# Apply the saved drift plan (this will revert the manual changes):
# REVIEW the plan one more time before applying:
terraform show drift.tfplan
# Apply — this will make real changes to infrastructure:
terraform apply drift.tfplan
# Monitor apply output for errors:
# A successful apply ends with:
# "Apply complete! Resources: X added, Y changed, Z destroyed."
"aws_security_group.web: Modifying... [id=sg-XXXXXXXXX]"
"aws_security_group.web: Modifications complete after Xs [id=sg-XXXXXXXXX]"
"Apply complete! Resources: 0 added, 1 changed, 0 destroyed."
terraform plan again to see the current state and assess whether you need to retry or manually fix the partial change.
Step 5B: Update Terraform Code to Match Reality (if change is intentional)¶
Why: When the manual change should be kept, the Terraform code must be updated to match — otherwise the next terraform plan will show the same drift again and risk being accidentally reverted.
# After running terraform import (Step 4), update the .tf file to match:
# Example: if someone added tags to an S3 bucket, add those tags to the resource block in .tf
# Then verify the code matches reality (plan should show no changes):
terraform plan -detailed-exitcode
echo "Exit code: $?"
# Should print exit code 0 (no changes) if your code matches the imported state
# If plan shows remaining differences, update the .tf file until plan is clean
After updating .tf files and running terraform plan:
"No changes. Your infrastructure matches the configuration."
exit code: 0
terraform plan to show no changes after importing, the Terraform provider may not support all attributes of the resource. Document the unsupported attributes as a known drift exception.
Step 6: Protect Critical Resources from Future Drift¶
Why: Some resources are so critical that accidental modification or deletion would be catastrophic. Terraform's lifecycle block can prevent this.
# Add lifecycle protection to critical resources in your .tf files:
# Example in a .tf file (do not run as a command — edit the file):
#
# resource "aws_s3_bucket" "critical_data" {
# bucket = "my-critical-data-bucket"
#
# lifecycle {
# prevent_destroy = true # blocks terraform destroy
# ignore_changes = [tags] # ignores tag drift (if tags are managed separately)
# }
# }
# After adding lifecycle blocks, verify the plan still looks correct:
terraform plan -detailed-exitcode
terraform plan exit code 0: no unexpected changes after adding lifecycle blocks.
Attempting to destroy a prevent_destroy resource now produces:
"Error: Instance cannot be destroyed"
"Resource <NAME> has lifecycle.prevent_destroy set, but the plan calls for this resource to be destroyed."
prevent_destroy causes a legitimate destroy operation to fail, you must explicitly remove the prevent_destroy = true line, apply, then re-add it. This is intentional friction.
Verification¶
# Confirm the issue is resolved
terraform plan -detailed-exitcode 2>&1 | tail -5
echo "Exit code: $?"
terraform plan output shows "No changes. Your infrastructure matches the configuration."
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 45 min | Platform/Infra on-call | "P2: Terraform drift in |
| Security misconfiguration drifted | Security on-call | "Security drift: |
| Evidence of unauthorized change | Security on-call | "Security incident: Terraform drift caused by suspected unauthorized manual change to |
| Scope expanding | Platform/Infra on-call | "Drift affecting multiple critical resources — review required before applying any plan" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Enable drift detection as a scheduled CI job (run
terraform plan -detailed-exitcodedaily) - Add
prevent_destroy = trueto all stateful, critical resources - Configure CloudTrail/audit log alerts for manual changes to Terraform-managed resources
- Review team processes — if drift is frequent, investigate why engineers are making manual changes
Common Mistakes¶
- Applying a drift plan without reviewing each change: A drift plan may include
destroyoperations that would delete data. Always reviewterraform show drift.tfplanbefore runningterraform apply drift.tfplan. - Not determining who made the manual change: Understanding the source tells you if it was authorized (should be codified) or unauthorized (security incident). Skipping this step means you may revert a deliberate, necessary change.
- Importing without updating Terraform code to match actual state: After
terraform import, the resource is in the state file but the .tf code is still wrong. If the code doesn't match, the nextterraform planwill show a diff — update the code until plan is clean. - Using a stale plan file: If you run
terraform plan -out=drift.tfplanand then significant time passes, the plan may no longer reflect current state. Alwaysterraform planagain before applying if more than a few minutes have passed. - Applying during peak traffic without coordination: A drift plan may change security groups, load balancers, or DNS — notify the team and check if there is a maintenance window before applying in production.
Cross-References¶
- Topic Pack:
training/library/topic-packs/cloud-terraform/(deep background on Terraform state and drift) - Related Runbook: terraform-state-lock.md — if you can't run terraform plan due to a stuck lock
- Related Runbook: capacity-limit.md — if terraform apply fails due to quota limits
- Related Runbook:
../security/unauthorized-access.md— if the drift was caused by unauthorized access
Wiki Navigation¶
Related Content¶
- Terraform Deep Dive (Topic Pack, L2) — Terraform, Terraform Deep Dive
- Case Study: SSH Timeout — MTU Mismatch, Fix Is Terraform Variable (Case Study, L2) — Terraform
- Case Study: Terraform Apply Fails — State Lock Stuck, DynamoDB Throttle (Case Study, L2) — Terraform
- Crossplane (Topic Pack, L2) — Terraform
- Deep Dive: Terraform State Internals (deep_dive, L2) — Terraform
- Mental Models (Core Concepts) (Topic Pack, L0) — Terraform
- OpenTofu & Terraform Ecosystem (Topic Pack, L2) — Terraform
- Pulumi (Topic Pack, L2) — Terraform
- Runbook: Cloud Capacity Limit Hit (Runbook, L2) — Terraform
- Runbook: Terraform State Lock Stuck (Runbook, L2) — Terraform
Pages that link here¶
- Crossplane
- Crossplane - Primer
- Infrastructure as Code with Terraform - Primer
- OpenTofu & Terraform Ecosystem - Primer
- Opentofu
- Operational Runbooks
- Pulumi
- Pulumi - Primer
- Runbook: Cloud Capacity Limit Hit
- Runbook: Terraform State Lock Stuck
- Symptoms: Terraform Apply Fails, State Lock Stuck, Root Cause Is DynamoDB Throttle
- Terraform / Infrastructure as Code - Skill Check
- Terraform Deep Dive
- Terraform Deep Dive - Primer
- Terraform Drills