devops
l2
runbook
terraform
terraform-deep-dive --- Portal | Level: L2: Operations | Topics: Terraform, Terraform Deep Dive | Domain: DevOps & Tooling

Runbook: Terraform Drift Detection Response¶

Field	Value
Domain	Cloud/Terraform
Alert	`terraform plan` shows unexpected resource changes, or drift detection job reports changes
Severity	P2
Est. Resolution Time	30-60 minutes
Escalation Timeout	45 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	Terraform CLI, cloud provider CLI, state file access, ability to run terraform plan/apply

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
terraform plan -detailed-exitcode 2>&1 | tail -20
echo "Exit code: $?"

If output shows: exit code 0 → No drift; the alert may be a false positive. Verify the alert source. If output shows: exit code 2 → Drift confirmed (changes detected); proceed to Step 1 If output shows: exit code 1 → Terraform error (not drift); check for configuration or credentials issue

Step 1: Capture the Drift Plan¶

Why: Before making any decisions, you need a full picture of every drifted resource. Saving the plan to a file ensures you can review it carefully and act on it deterministically.

# Run terraform plan with output saved to a file:
terraform plan -out=drift.tfplan 2>&1 | tee drift-plan-output.txt

# View a human-readable summary of what will change:
cat drift-plan-output.txt

# For a more structured view:
terraform show drift.tfplan

# Get a count of changes by type:
grep -E "^  # |will be|must be" drift-plan-output.txt | head -50

Expected output:

Terraform will display the drift as a plan:
  # aws_security_group.web will be updated in-place
  ~ resource "aws_security_group" "web" {
      ~ ingress {
          + cidr_blocks = ["0.0.0.0/0"]   ← someone opened this to the world manually
        }
    }

  # aws_instance.app[0] has been deleted
  - resource "aws_instance" "app" {        ← someone terminated this instance
    }

Plan: 0 to add, 1 to change, 1 to destroy.

If this fails: If terraform plan fails with credential errors, check aws sts get-caller-identity (AWS) or gcloud auth list (GCP) to confirm your credentials are valid.

Step 2: Review Each Changed Resource — Expected or Manual Intervention?¶

Why: Not all drift is bad. Some changes are deliberate (a team manually patched a config) and should be codified in Terraform. Others are accidents or security incidents. You must categorize each change before deciding how to respond.

# For each changed resource in the plan, determine:
# Question 1: Was this change authorized?
# Question 2: Is it safe to apply (Terraform would "fix" it back to code)?
# Question 3: Is it safe to import (Terraform code should be updated to match current reality)?

# Categories:
# EXPECTED  → someone deliberately made this change and it should be kept
# ACCIDENT  → someone made this change by mistake; revert with terraform apply
# SECURITY  → unauthorized or dangerous change; escalate before acting
# UNKNOWN   → unclear who/why; investigate before acting

# Example: security group with 0.0.0.0/0 added → investigate immediately
# Example: instance count changed → check if autoscaling or manual action

echo "Manually categorize each change in drift-plan-output.txt before proceeding"

Expected output:

A categorized list:
  aws_security_group.web: ingress 0.0.0.0/0 added → SECURITY — escalate
  aws_instance.app[0]: deleted → UNKNOWN — investigate
  aws_s3_bucket.logs: tags changed → EXPECTED — team updated tags manually, should codify

If this fails: If the plan is too large to review manually, focus on destroy operations first (highest risk), then update operations that affect security resources (security groups, IAM, network ACLs).

Step 3: Investigate the Source of the Change¶

Why: Drift caused by an authorized team member requires a different response than drift caused by an unauthorized actor or a runaway script.

# AWS CloudTrail — find who made the manual change:
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=<RESOURCE_ID> \
  --start-time $(date -d '7 days ago' --iso-8601=seconds) \
  --output json | \
  jq '.Events[] | {Username: .Username, EventName: .EventName, EventTime: .EventTime}'

# GCP Audit Logs — find who changed the resource:
gcloud logging read \
  'resource.labels.resource_name="<RESOURCE_NAME>"' \
  --limit=10 \
  --format="table(timestamp, protoPayload.methodName, protoPayload.authenticationInfo.principalEmail)"

# Azure Activity Log:
az monitor activity-log list \
  --resource-id <RESOURCE_ID> \
  --start-time $(date -d '7 days ago' --iso-8601=seconds) \
  --query "[].{Caller:caller, Operation:operationName.localizedValue, Time:eventTimestamp}"

Expected output:

CloudTrail shows:
  {
    "Username": "alice",
    "EventName": "AuthorizeSecurityGroupIngress",
    "EventTime": "2026-03-18T22:15:00Z"
  }
This tells you: alice opened port 0.0.0.0/0 yesterday at 10:15 PM — contact alice for context.

If this fails: If audit logs are not available (not enabled or retention expired), document the gap and proceed with drift remediation based on current policy (Terraform code = source of truth unless there is a business reason to deviate).

Step 4: Decide — Apply (Revert to Code) or Import (Codify the Change)¶

Why: There are two valid responses to drift. You must choose consciously — accidentally applying when you should import destroys intentional manual changes.

# DECISION FRAMEWORK:
# Use terraform apply (revert to Terraform code) when:
#   - The manual change was accidental or unauthorized
#   - The Terraform code represents the desired state
#   - A security misconfiguration must be reverted immediately

# Use terraform import + code update (accept the change) when:
#   - The manual change was authorized and should be permanent
#   - The Terraform code needs to be updated to match reality
#   - The change cannot be easily reverted without service impact

# To preview what "apply" will change without executing:
terraform show drift.tfplan | grep -E "will be|must be|will be destroyed"

# To import a resource (example: an S3 bucket):
terraform import aws_s3_bucket.<RESOURCE_NAME> <BUCKET_NAME>

# To import an EC2 instance:
terraform import aws_instance.<RESOURCE_NAME> <INSTANCE_ID>

# To import an AWS security group:
terraform import aws_security_group.<RESOURCE_NAME> <SG_ID>

echo "Document your decision: APPLY or IMPORT, and the reason, in the incident log"

Expected output:

For terraform import:
  "aws_s3_bucket.<RESOURCE_NAME>: Importing from ID..."
  "aws_s3_bucket.<RESOURCE_NAME>: Import complete! The resource has been imported."
  "aws_s3_bucket.<RESOURCE_NAME>: Refreshing state..."

If this fails: If terraform import fails, the resource ID may be wrong. Check the cloud console for the exact resource identifier format required by the Terraform provider.

Step 5A: Apply to Revert (if drift is accidental/unauthorized)¶

Why: When Terraform code is the source of truth and the manual change should not persist, apply the saved plan to bring infrastructure back to the desired state.

# Apply the saved drift plan (this will revert the manual changes):
# REVIEW the plan one more time before applying:
terraform show drift.tfplan

# Apply — this will make real changes to infrastructure:
terraform apply drift.tfplan

# Monitor apply output for errors:
# A successful apply ends with:
# "Apply complete! Resources: X added, Y changed, Z destroyed."

Expected output:

"aws_security_group.web: Modifying... [id=sg-XXXXXXXXX]"
"aws_security_group.web: Modifications complete after Xs [id=sg-XXXXXXXXX]"
"Apply complete! Resources: 0 added, 1 changed, 0 destroyed."

If this fails: If apply fails mid-way, Terraform may have partially applied changes. Run terraform plan again to see the current state and assess whether you need to retry or manually fix the partial change.

Step 5B: Update Terraform Code to Match Reality (if change is intentional)¶

Why: When the manual change should be kept, the Terraform code must be updated to match — otherwise the next terraform plan will show the same drift again and risk being accidentally reverted.

# After running terraform import (Step 4), update the .tf file to match:
# Example: if someone added tags to an S3 bucket, add those tags to the resource block in .tf

# Then verify the code matches reality (plan should show no changes):
terraform plan -detailed-exitcode
echo "Exit code: $?"
# Should print exit code 0 (no changes) if your code matches the imported state

# If plan shows remaining differences, update the .tf file until plan is clean

Expected output:

After updating .tf files and running terraform plan:
  "No changes. Your infrastructure matches the configuration."
  exit code: 0

If this fails: If you cannot get terraform plan to show no changes after importing, the Terraform provider may not support all attributes of the resource. Document the unsupported attributes as a known drift exception.

Step 6: Protect Critical Resources from Future Drift¶

Why: Some resources are so critical that accidental modification or deletion would be catastrophic. Terraform's lifecycle block can prevent this.

# Add lifecycle protection to critical resources in your .tf files:
# Example in a .tf file (do not run as a command — edit the file):
#
# resource "aws_s3_bucket" "critical_data" {
#   bucket = "my-critical-data-bucket"
#
#   lifecycle {
#     prevent_destroy = true          # blocks terraform destroy
#     ignore_changes  = [tags]        # ignores tag drift (if tags are managed separately)
#   }
# }

# After adding lifecycle blocks, verify the plan still looks correct:
terraform plan -detailed-exitcode

Expected output:

terraform plan exit code 0: no unexpected changes after adding lifecycle blocks.
Attempting to destroy a prevent_destroy resource now produces:
  "Error: Instance cannot be destroyed"
  "Resource <NAME> has lifecycle.prevent_destroy set, but the plan calls for this resource to be destroyed."

If this fails: If prevent_destroy causes a legitimate destroy operation to fail, you must explicitly remove the prevent_destroy = true line, apply, then re-add it. This is intentional friction.

Verification¶

# Confirm the issue is resolved
terraform plan -detailed-exitcode 2>&1 | tail -5
echo "Exit code: $?"

Success looks like: Exit code 0, terraform plan output shows "No changes. Your infrastructure matches the configuration." If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 45 min	Platform/Infra on-call	"P2: Terraform drift in cannot be resolved — need senior help, changes include destroys"
Security misconfiguration drifted	Security on-call	"Security drift: has been misconfigured (e.g., security group 0.0.0.0/0 added) — investigating source"
Evidence of unauthorized change	Security on-call	"Security incident: Terraform drift caused by suspected unauthorized manual change to "
Scope expanding	Platform/Infra on-call	"Drift affecting multiple critical resources — review required before applying any plan"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Enable drift detection as a scheduled CI job (run terraform plan -detailed-exitcode daily)
Add prevent_destroy = true to all stateful, critical resources
Configure CloudTrail/audit log alerts for manual changes to Terraform-managed resources
Review team processes — if drift is frequent, investigate why engineers are making manual changes

Common Mistakes¶

Applying a drift plan without reviewing each change: A drift plan may include destroy operations that would delete data. Always review terraform show drift.tfplan before running terraform apply drift.tfplan.
Not determining who made the manual change: Understanding the source tells you if it was authorized (should be codified) or unauthorized (security incident). Skipping this step means you may revert a deliberate, necessary change.
Importing without updating Terraform code to match actual state: After terraform import, the resource is in the state file but the .tf code is still wrong. If the code doesn't match, the next terraform plan will show a diff — update the code until plan is clean.
Using a stale plan file: If you run terraform plan -out=drift.tfplan and then significant time passes, the plan may no longer reflect current state. Always terraform plan again before applying if more than a few minutes have passed.
Applying during peak traffic without coordination: A drift plan may change security groups, load balancers, or DNS — notify the team and check if there is a maintenance window before applying in production.

Cross-References¶

Topic Pack: training/library/topic-packs/cloud-terraform/ (deep background on Terraform state and drift)
Related Runbook: terraform-state-lock.md — if you can't run terraform plan due to a stuck lock
Related Runbook: capacity-limit.md — if terraform apply fails due to quota limits
Related Runbook: ../security/unauthorized-access.md — if the drift was caused by unauthorized access

Terraform Deep Dive (Topic Pack, L2) — Terraform, Terraform Deep Dive
Case Study: SSH Timeout — MTU Mismatch, Fix Is Terraform Variable (Case Study, L2) — Terraform
Case Study: Terraform Apply Fails — State Lock Stuck, DynamoDB Throttle (Case Study, L2) — Terraform
Crossplane (Topic Pack, L2) — Terraform
Deep Dive: Terraform State Internals (deep_dive, L2) — Terraform
Mental Models (Core Concepts) (Topic Pack, L0) — Terraform
OpenTofu & Terraform Ecosystem (Topic Pack, L2) — Terraform
Pulumi (Topic Pack, L2) — Terraform
Runbook: Cloud Capacity Limit Hit (Runbook, L2) — Terraform
Runbook: Terraform State Lock Stuck (Runbook, L2) — Terraform

Runbook: Terraform Drift Detection Response¶

Quick Assessment (30 seconds)¶

Step 1: Capture the Drift Plan¶

Step 2: Review Each Changed Resource — Expected or Manual Intervention?¶

Step 3: Investigate the Source of the Change¶

Step 4: Decide — Apply (Revert to Code) or Import (Codify the Change)¶

Step 5A: Apply to Revert (if drift is accidental/unauthorized)¶

Step 5B: Update Terraform Code to Match Reality (if change is intentional)¶

Step 6: Protect Critical Resources from Future Drift¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Terraform Drift Detection Response¶

Quick Assessment (30 seconds)¶

Step 1: Capture the Drift Plan¶

Step 2: Review Each Changed Resource — Expected or Manual Intervention?¶

Step 3: Investigate the Source of the Change¶

Step 4: Decide — Apply (Revert to Code) or Import (Codify the Change)¶

Step 5A: Apply to Revert (if drift is accidental/unauthorized)¶

Step 5B: Update Terraform Code to Match Reality (if change is intentional)¶

Step 6: Protect Critical Resources from Future Drift¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶