Skip to content

Runbook: Terraform Drift Detection Response

Field Value
Domain Cloud/Terraform
Alert terraform plan shows unexpected resource changes, or drift detection job reports changes
Severity P2
Est. Resolution Time 30-60 minutes
Escalation Timeout 45 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites Terraform CLI, cloud provider CLI, state file access, ability to run terraform plan/apply

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
terraform plan -detailed-exitcode 2>&1 | tail -20
echo "Exit code: $?"
If output shows: exit code 0 → No drift; the alert may be a false positive. Verify the alert source. If output shows: exit code 2 → Drift confirmed (changes detected); proceed to Step 1 If output shows: exit code 1 → Terraform error (not drift); check for configuration or credentials issue

Step 1: Capture the Drift Plan

Why: Before making any decisions, you need a full picture of every drifted resource. Saving the plan to a file ensures you can review it carefully and act on it deterministically.

# Run terraform plan with output saved to a file:
terraform plan -out=drift.tfplan 2>&1 | tee drift-plan-output.txt

# View a human-readable summary of what will change:
cat drift-plan-output.txt

# For a more structured view:
terraform show drift.tfplan

# Get a count of changes by type:
grep -E "^  # |will be|must be" drift-plan-output.txt | head -50
Expected output:
Terraform will display the drift as a plan:
  # aws_security_group.web will be updated in-place
  ~ resource "aws_security_group" "web" {
      ~ ingress {
          + cidr_blocks = ["0.0.0.0/0"]   ← someone opened this to the world manually
        }
    }

  # aws_instance.app[0] has been deleted
  - resource "aws_instance" "app" {        ← someone terminated this instance
    }

Plan: 0 to add, 1 to change, 1 to destroy.
If this fails: If terraform plan fails with credential errors, check aws sts get-caller-identity (AWS) or gcloud auth list (GCP) to confirm your credentials are valid.

Step 2: Review Each Changed Resource — Expected or Manual Intervention?

Why: Not all drift is bad. Some changes are deliberate (a team manually patched a config) and should be codified in Terraform. Others are accidents or security incidents. You must categorize each change before deciding how to respond.

# For each changed resource in the plan, determine:
# Question 1: Was this change authorized?
# Question 2: Is it safe to apply (Terraform would "fix" it back to code)?
# Question 3: Is it safe to import (Terraform code should be updated to match current reality)?

# Categories:
# EXPECTED  → someone deliberately made this change and it should be kept
# ACCIDENT  → someone made this change by mistake; revert with terraform apply
# SECURITY  → unauthorized or dangerous change; escalate before acting
# UNKNOWN   → unclear who/why; investigate before acting

# Example: security group with 0.0.0.0/0 added → investigate immediately
# Example: instance count changed → check if autoscaling or manual action

echo "Manually categorize each change in drift-plan-output.txt before proceeding"
Expected output:
A categorized list:
  aws_security_group.web: ingress 0.0.0.0/0 added → SECURITY — escalate
  aws_instance.app[0]: deleted → UNKNOWN — investigate
  aws_s3_bucket.logs: tags changed → EXPECTED — team updated tags manually, should codify
If this fails: If the plan is too large to review manually, focus on destroy operations first (highest risk), then update operations that affect security resources (security groups, IAM, network ACLs).

Step 3: Investigate the Source of the Change

Why: Drift caused by an authorized team member requires a different response than drift caused by an unauthorized actor or a runaway script.

# AWS CloudTrail — find who made the manual change:
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=<RESOURCE_ID> \
  --start-time $(date -d '7 days ago' --iso-8601=seconds) \
  --output json | \
  jq '.Events[] | {Username: .Username, EventName: .EventName, EventTime: .EventTime}'

# GCP Audit Logs — find who changed the resource:
gcloud logging read \
  'resource.labels.resource_name="<RESOURCE_NAME>"' \
  --limit=10 \
  --format="table(timestamp, protoPayload.methodName, protoPayload.authenticationInfo.principalEmail)"

# Azure Activity Log:
az monitor activity-log list \
  --resource-id <RESOURCE_ID> \
  --start-time $(date -d '7 days ago' --iso-8601=seconds) \
  --query "[].{Caller:caller, Operation:operationName.localizedValue, Time:eventTimestamp}"
Expected output:
CloudTrail shows:
  {
    "Username": "alice",
    "EventName": "AuthorizeSecurityGroupIngress",
    "EventTime": "2026-03-18T22:15:00Z"
  }
This tells you: alice opened port 0.0.0.0/0 yesterday at 10:15 PM — contact alice for context.
If this fails: If audit logs are not available (not enabled or retention expired), document the gap and proceed with drift remediation based on current policy (Terraform code = source of truth unless there is a business reason to deviate).

Step 4: Decide — Apply (Revert to Code) or Import (Codify the Change)

Why: There are two valid responses to drift. You must choose consciously — accidentally applying when you should import destroys intentional manual changes.

# DECISION FRAMEWORK:
# Use terraform apply (revert to Terraform code) when:
#   - The manual change was accidental or unauthorized
#   - The Terraform code represents the desired state
#   - A security misconfiguration must be reverted immediately

# Use terraform import + code update (accept the change) when:
#   - The manual change was authorized and should be permanent
#   - The Terraform code needs to be updated to match reality
#   - The change cannot be easily reverted without service impact

# To preview what "apply" will change without executing:
terraform show drift.tfplan | grep -E "will be|must be|will be destroyed"

# To import a resource (example: an S3 bucket):
terraform import aws_s3_bucket.<RESOURCE_NAME> <BUCKET_NAME>

# To import an EC2 instance:
terraform import aws_instance.<RESOURCE_NAME> <INSTANCE_ID>

# To import an AWS security group:
terraform import aws_security_group.<RESOURCE_NAME> <SG_ID>

echo "Document your decision: APPLY or IMPORT, and the reason, in the incident log"
Expected output:
For terraform import:
  "aws_s3_bucket.<RESOURCE_NAME>: Importing from ID..."
  "aws_s3_bucket.<RESOURCE_NAME>: Import complete! The resource has been imported."
  "aws_s3_bucket.<RESOURCE_NAME>: Refreshing state..."
If this fails: If terraform import fails, the resource ID may be wrong. Check the cloud console for the exact resource identifier format required by the Terraform provider.

Step 5A: Apply to Revert (if drift is accidental/unauthorized)

Why: When Terraform code is the source of truth and the manual change should not persist, apply the saved plan to bring infrastructure back to the desired state.

# Apply the saved drift plan (this will revert the manual changes):
# REVIEW the plan one more time before applying:
terraform show drift.tfplan

# Apply — this will make real changes to infrastructure:
terraform apply drift.tfplan

# Monitor apply output for errors:
# A successful apply ends with:
# "Apply complete! Resources: X added, Y changed, Z destroyed."
Expected output:
"aws_security_group.web: Modifying... [id=sg-XXXXXXXXX]"
"aws_security_group.web: Modifications complete after Xs [id=sg-XXXXXXXXX]"
"Apply complete! Resources: 0 added, 1 changed, 0 destroyed."
If this fails: If apply fails mid-way, Terraform may have partially applied changes. Run terraform plan again to see the current state and assess whether you need to retry or manually fix the partial change.

Step 5B: Update Terraform Code to Match Reality (if change is intentional)

Why: When the manual change should be kept, the Terraform code must be updated to match — otherwise the next terraform plan will show the same drift again and risk being accidentally reverted.

# After running terraform import (Step 4), update the .tf file to match:
# Example: if someone added tags to an S3 bucket, add those tags to the resource block in .tf

# Then verify the code matches reality (plan should show no changes):
terraform plan -detailed-exitcode
echo "Exit code: $?"
# Should print exit code 0 (no changes) if your code matches the imported state

# If plan shows remaining differences, update the .tf file until plan is clean
Expected output:
After updating .tf files and running terraform plan:
  "No changes. Your infrastructure matches the configuration."
  exit code: 0
If this fails: If you cannot get terraform plan to show no changes after importing, the Terraform provider may not support all attributes of the resource. Document the unsupported attributes as a known drift exception.

Step 6: Protect Critical Resources from Future Drift

Why: Some resources are so critical that accidental modification or deletion would be catastrophic. Terraform's lifecycle block can prevent this.

# Add lifecycle protection to critical resources in your .tf files:
# Example in a .tf file (do not run as a command — edit the file):
#
# resource "aws_s3_bucket" "critical_data" {
#   bucket = "my-critical-data-bucket"
#
#   lifecycle {
#     prevent_destroy = true          # blocks terraform destroy
#     ignore_changes  = [tags]        # ignores tag drift (if tags are managed separately)
#   }
# }

# After adding lifecycle blocks, verify the plan still looks correct:
terraform plan -detailed-exitcode
Expected output:
terraform plan exit code 0: no unexpected changes after adding lifecycle blocks.
Attempting to destroy a prevent_destroy resource now produces:
  "Error: Instance cannot be destroyed"
  "Resource <NAME> has lifecycle.prevent_destroy set, but the plan calls for this resource to be destroyed."
If this fails: If prevent_destroy causes a legitimate destroy operation to fail, you must explicitly remove the prevent_destroy = true line, apply, then re-add it. This is intentional friction.

Verification

# Confirm the issue is resolved
terraform plan -detailed-exitcode 2>&1 | tail -5
echo "Exit code: $?"
Success looks like: Exit code 0, terraform plan output shows "No changes. Your infrastructure matches the configuration." If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 45 min Platform/Infra on-call "P2: Terraform drift in cannot be resolved — need senior help, changes include destroys"
Security misconfiguration drifted Security on-call "Security drift: has been misconfigured (e.g., security group 0.0.0.0/0 added) — investigating source"
Evidence of unauthorized change Security on-call "Security incident: Terraform drift caused by suspected unauthorized manual change to "
Scope expanding Platform/Infra on-call "Drift affecting multiple critical resources — review required before applying any plan"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Enable drift detection as a scheduled CI job (run terraform plan -detailed-exitcode daily)
  • Add prevent_destroy = true to all stateful, critical resources
  • Configure CloudTrail/audit log alerts for manual changes to Terraform-managed resources
  • Review team processes — if drift is frequent, investigate why engineers are making manual changes

Common Mistakes

  1. Applying a drift plan without reviewing each change: A drift plan may include destroy operations that would delete data. Always review terraform show drift.tfplan before running terraform apply drift.tfplan.
  2. Not determining who made the manual change: Understanding the source tells you if it was authorized (should be codified) or unauthorized (security incident). Skipping this step means you may revert a deliberate, necessary change.
  3. Importing without updating Terraform code to match actual state: After terraform import, the resource is in the state file but the .tf code is still wrong. If the code doesn't match, the next terraform plan will show a diff — update the code until plan is clean.
  4. Using a stale plan file: If you run terraform plan -out=drift.tfplan and then significant time passes, the plan may no longer reflect current state. Always terraform plan again before applying if more than a few minutes have passed.
  5. Applying during peak traffic without coordination: A drift plan may change security groups, load balancers, or DNS — notify the team and check if there is a maintenance window before applying in production.

Cross-References

  • Topic Pack: training/library/topic-packs/cloud-terraform/ (deep background on Terraform state and drift)
  • Related Runbook: terraform-state-lock.md — if you can't run terraform plan due to a stuck lock
  • Related Runbook: capacity-limit.md — if terraform apply fails due to quota limits
  • Related Runbook: ../security/unauthorized-access.md — if the drift was caused by unauthorized access

Wiki Navigation