The Terraform State Disaster

lesson
terraform-state
locking
drift
workspaces
import
recovery
remote-backends
l2 ---# The Terraform State Disaster

Topics: Terraform state, locking, drift, workspaces, import, recovery, remote backends Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic Terraform awareness helpful but not required

The Mission¶

$ terraform apply
Error: Error locking state: Error acquiring the state lock

$ terraform plan
Error: Failed to load state: the state file is empty or corrupt

Terraform state is the most critical file in your infrastructure. It maps your .tf configuration to real cloud resources. Without it, Terraform doesn't know what it manages — and the recovery is painful. This lesson covers how state works, the disasters that happen when it breaks, and how to prevent and recover from each one.

What State Actually Is¶

Terraform state is a JSON file that maps your config to real infrastructure:

{
  "resources": [{
    "type": "aws_instance",
    "name": "web",
    "instances": [{
      "attributes": {
        "id": "i-0abc123def456789",
        "ami": "ami-0abcdef1234567890",
        "instance_type": "t3.medium",
        "private_ip": "10.0.1.50"
      }
    }]
  }]
}

Without this file, Terraform sees no existing resources. terraform plan shows "create everything from scratch" — even though the infrastructure already exists in AWS.

Mental Model: Terraform state is a photograph of your infrastructure. If someone moves the furniture after the photo, Terraform doesn't know. It compares the photo (state) to the blueprint (config) and proposes changes to make reality match the blueprint. If the photo is lost, Terraform thinks the room is empty and tries to buy all new furniture.

Disaster 1: Local State, Multiple Operators¶

The default: terraform.tfstate is a local file. Two engineers run terraform apply simultaneously. Both read the same state, both make changes, one overwrites the other's state file. Resources are orphaned — Terraform no longer knows they exist.

Fix: Remote state with locking.

# backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/main.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"    # Lock table
    encrypt        = true
  }
}

Gotcha: The DynamoDB lock table requires a partition key named exactly LockID of type String. If you name it id, lock_id, or anything else, state locking silently fails — you get no error, no locking protection, and don't realize it until two people apply at once.

Disaster 2: `terraform destroy` on the Wrong Workspace¶

$ terraform workspace select staging  # Thought it was staging
$ terraform destroy                   # Actually production

War Story: An engineer needed to destroy a staging environment. They ran terraform destroy in what they thought was the staging workspace. The prompt said "Do you really want to destroy all resources?" They typed "yes." It was the production workspace. The PostgreSQL RDS instance, with the production database, was deleted. No automated backups were enabled (they'd been disabled "temporarily" 6 months ago). Recovery took 11 hours and involved restoring from a 24-hour-old manual snapshot — losing a full day of data.

Prevention:

# Add workspace-aware safety checks
locals {
  is_production = terraform.workspace == "production"
}

# Prevent accidental destroy of critical resources
resource "aws_db_instance" "main" {
  # ...
  deletion_protection = local.is_production
  skip_final_snapshot = !local.is_production
}

Or better: don't use workspaces for environment separation. Use separate state files with separate backends:

# Separate directories, separate state, no confusion
infrastructure/
  production/   → own backend, own state
  staging/      → own backend, own state
  dev/          → own backend, own state

Disaster 3: State Drift — Someone Edited Infrastructure Manually¶

Someone logged into the AWS console and changed an instance type from t3.medium to t3.large. The state file still says t3.medium. The .tf config says t3.medium. Next terraform plan:

# aws_instance.web will be updated in-place
  ~ resource "aws_instance" "web" {
      ~ instance_type = "t3.large" -> "t3.medium"   # Reverts the manual change!
    }

Terraform will revert the manual change because its job is to make reality match config.

# Detect drift
terraform plan   # Shows what would change to match config

# Accept the manual change into state (refresh state from reality)
terraform apply -refresh-only

# Or import the current state
terraform import aws_instance.web i-0abc123def456789

Gotcha: terraform apply -refresh-only updates state to match reality but doesn't update your .tf files. Now state and config disagree, and the next terraform plan will show changes. You need to also update the .tf file to match, or the drift returns.

Disaster 4: Corrupted or Lost State¶

The state file is empty, corrupt, or deleted entirely. Terraform thinks no infrastructure exists.

# Step 1: DO NOT RUN terraform apply — it will try to create everything
#         (and fail because resources already exist, or worse, create duplicates)

# Step 2: Check if you have a backup
# S3 backend with versioning:
aws s3api list-object-versions --bucket mycompany-terraform-state --prefix production/main.tfstate

# Restore previous version:
aws s3api get-object --bucket mycompany-terraform-state \
    --key production/main.tfstate \
    --version-id "abc123" \
    terraform.tfstate.restored

# Step 3: If no backup, reconstruct state by importing resources
terraform import aws_instance.web i-0abc123def456789
terraform import aws_db_instance.main mydb-production
terraform import aws_vpc.main vpc-0abc123def
# ... for every resource

Gotcha: terraform import only imports into state — it doesn't generate .tf code. You need to write the resource blocks manually, then import. As of Terraform 1.5+, terraform plan -generate-config-out=generated.tf can generate HCL from imported resources, but it's not perfect and needs manual cleanup.

Disaster 5: State Lock Stuck¶

Someone's terraform apply was interrupted (Ctrl+C, network drop, laptop crash). The DynamoDB lock was acquired but never released. Everyone is blocked:

$ terraform plan
Error: Error locking state: Error acquiring the state lock:
  Lock Info:
    ID:        abc123-def456
    Path:      production/main.tfstate
    Operation: OperationTypeApply
    Who:       alice@laptop
    Created:   2026-03-22 14:23:01

# Verify Alice isn't actually running something
# If confirmed she's not:
terraform force-unlock abc123-def456

# NEVER force-unlock if someone is running apply — it causes state corruption

The Diagnostic Ladder¶

Terraform broken
│
├── State lock error?
│   terraform force-unlock <ID>  (only if nobody is applying!)
│
├── Empty or corrupt state?
│   ├── S3 versioning enabled? → restore previous version
│   └── No backup? → terraform import for every resource
│
├── Plan shows destroying everything?
│   ├── Wrong workspace? → terraform workspace list
│   ├── Wrong backend? → check backend.tf
│   └── State refreshed to empty? → restore from backup
│
├── Plan shows unwanted changes?
│   ├── Manual console change? → terraform apply -refresh-only
│   └── Provider version change? → check .terraform.lock.hcl
│
└── Resources orphaned? (exist in cloud, not in state)
    terraform import <resource_type>.<name> <cloud_id>

Flashcard Check¶

Q1: What does Terraform state contain?

A JSON mapping between your .tf resource blocks and real cloud resource IDs. Without it, Terraform doesn't know what infrastructure it manages.

Q2: Why use remote state with locking?

Local state + multiple operators = state corruption. Remote state (S3) provides shared access; DynamoDB locking prevents simultaneous writes.

Q3: terraform plan shows "destroy and recreate" for everything. Why?

State is missing, empty, or pointing to the wrong backend. Terraform thinks no resources exist. Do NOT apply — restore state from backup or import.

Q4: Someone changed an AWS resource manually. What does Terraform do?

Next plan proposes reverting the change (making reality match config). Use terraform apply -refresh-only to update state, then update .tf to match.

Q5: force-unlock — when is it safe?

Only when you've confirmed nobody is running terraform apply. If someone is, force-unlock causes state corruption (two writers, no lock).

Cheat Sheet¶

Task	Command
Check current workspace	`terraform workspace show`
List all workspaces	`terraform workspace list`
Refresh state from reality	`terraform apply -refresh-only`
Import existing resource	`terraform import TYPE.NAME CLOUD_ID`
Force-unlock state	`terraform force-unlock LOCK_ID`
Show state contents	`terraform state list`
Show specific resource	`terraform state show TYPE.NAME`
Move resource in state	`terraform state mv OLD NEW`
Remove from state (don't destroy)	`terraform state rm TYPE.NAME`
S3 state versions	`aws s3api list-object-versions --bucket BUCKET --prefix KEY`

Takeaways¶

State is the source of truth. Lose it, and Terraform thinks your infrastructure doesn't exist. Enable S3 versioning and DynamoDB locking from day one.
Never use local state in a team. Even for "just this one project." You will forget, and two people will apply simultaneously.
Separate state files per environment. Don't use workspaces to separate prod/staging. Use separate directories with separate backends.
terraform import is tedious but essential. When state is lost, importing every resource is the only recovery path. Keep an inventory of your resource IDs.
Manual console changes create drift. Terraform reverts them on next apply. Either stop making manual changes, or refresh-only + update your .tf files.

Terraform vs Ansible vs Helm — choosing the right IaC tool
The Hanging Deploy — when deploys that call Terraform go wrong
Permission Denied — IAM permissions for Terraform operations