The Terraform State Disaster
- lesson
- terraform-state
- locking
- drift
- workspaces
- import
- recovery
- remote-backends
- l2 ---# The Terraform State Disaster
Topics: Terraform state, locking, drift, workspaces, import, recovery, remote backends Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic Terraform awareness helpful but not required
The Mission¶
$ terraform apply
Error: Error locking state: Error acquiring the state lock
$ terraform plan
Error: Failed to load state: the state file is empty or corrupt
Terraform state is the most critical file in your infrastructure. It maps your .tf
configuration to real cloud resources. Without it, Terraform doesn't know what it manages —
and the recovery is painful. This lesson covers how state works, the disasters that happen
when it breaks, and how to prevent and recover from each one.
What State Actually Is¶
Terraform state is a JSON file that maps your config to real infrastructure:
{
"resources": [{
"type": "aws_instance",
"name": "web",
"instances": [{
"attributes": {
"id": "i-0abc123def456789",
"ami": "ami-0abcdef1234567890",
"instance_type": "t3.medium",
"private_ip": "10.0.1.50"
}
}]
}]
}
Without this file, Terraform sees no existing resources. terraform plan shows "create
everything from scratch" — even though the infrastructure already exists in AWS.
Mental Model: Terraform state is a photograph of your infrastructure. If someone moves the furniture after the photo, Terraform doesn't know. It compares the photo (state) to the blueprint (config) and proposes changes to make reality match the blueprint. If the photo is lost, Terraform thinks the room is empty and tries to buy all new furniture.
Disaster 1: Local State, Multiple Operators¶
The default: terraform.tfstate is a local file. Two engineers run terraform apply
simultaneously. Both read the same state, both make changes, one overwrites the other's
state file. Resources are orphaned — Terraform no longer knows they exist.
Fix: Remote state with locking.
# backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/main.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks" # Lock table
encrypt = true
}
}
Gotcha: The DynamoDB lock table requires a partition key named exactly
LockIDof type String. If you name itid,lock_id, or anything else, state locking silently fails — you get no error, no locking protection, and don't realize it until two people apply at once.
Disaster 2: terraform destroy on the Wrong Workspace¶
$ terraform workspace select staging # Thought it was staging
$ terraform destroy # Actually production
War Story: An engineer needed to destroy a staging environment. They ran
terraform destroyin what they thought was the staging workspace. The prompt said "Do you really want to destroy all resources?" They typed "yes." It was the production workspace. The PostgreSQL RDS instance, with the production database, was deleted. No automated backups were enabled (they'd been disabled "temporarily" 6 months ago). Recovery took 11 hours and involved restoring from a 24-hour-old manual snapshot — losing a full day of data.
Prevention:
# Add workspace-aware safety checks
locals {
is_production = terraform.workspace == "production"
}
# Prevent accidental destroy of critical resources
resource "aws_db_instance" "main" {
# ...
deletion_protection = local.is_production
skip_final_snapshot = !local.is_production
}
Or better: don't use workspaces for environment separation. Use separate state files with separate backends:
# Separate directories, separate state, no confusion
infrastructure/
production/ → own backend, own state
staging/ → own backend, own state
dev/ → own backend, own state
Disaster 3: State Drift — Someone Edited Infrastructure Manually¶
Someone logged into the AWS console and changed an instance type from t3.medium to
t3.large. The state file still says t3.medium. The .tf config says t3.medium.
Next terraform plan:
# aws_instance.web will be updated in-place
~ resource "aws_instance" "web" {
~ instance_type = "t3.large" -> "t3.medium" # Reverts the manual change!
}
Terraform will revert the manual change because its job is to make reality match config.
# Detect drift
terraform plan # Shows what would change to match config
# Accept the manual change into state (refresh state from reality)
terraform apply -refresh-only
# Or import the current state
terraform import aws_instance.web i-0abc123def456789
Gotcha:
terraform apply -refresh-onlyupdates state to match reality but doesn't update your.tffiles. Now state and config disagree, and the nextterraform planwill show changes. You need to also update the.tffile to match, or the drift returns.
Disaster 4: Corrupted or Lost State¶
The state file is empty, corrupt, or deleted entirely. Terraform thinks no infrastructure exists.
# Step 1: DO NOT RUN terraform apply — it will try to create everything
# (and fail because resources already exist, or worse, create duplicates)
# Step 2: Check if you have a backup
# S3 backend with versioning:
aws s3api list-object-versions --bucket mycompany-terraform-state --prefix production/main.tfstate
# Restore previous version:
aws s3api get-object --bucket mycompany-terraform-state \
--key production/main.tfstate \
--version-id "abc123" \
terraform.tfstate.restored
# Step 3: If no backup, reconstruct state by importing resources
terraform import aws_instance.web i-0abc123def456789
terraform import aws_db_instance.main mydb-production
terraform import aws_vpc.main vpc-0abc123def
# ... for every resource
Gotcha:
terraform importonly imports into state — it doesn't generate.tfcode. You need to write the resource blocks manually, then import. As of Terraform 1.5+,terraform plan -generate-config-out=generated.tfcan generate HCL from imported resources, but it's not perfect and needs manual cleanup.
Disaster 5: State Lock Stuck¶
Someone's terraform apply was interrupted (Ctrl+C, network drop, laptop crash). The
DynamoDB lock was acquired but never released. Everyone is blocked:
$ terraform plan
Error: Error locking state: Error acquiring the state lock:
Lock Info:
ID: abc123-def456
Path: production/main.tfstate
Operation: OperationTypeApply
Who: alice@laptop
Created: 2026-03-22 14:23:01
# Verify Alice isn't actually running something
# If confirmed she's not:
terraform force-unlock abc123-def456
# NEVER force-unlock if someone is running apply — it causes state corruption
The Diagnostic Ladder¶
Terraform broken
│
├── State lock error?
│ terraform force-unlock <ID> (only if nobody is applying!)
│
├── Empty or corrupt state?
│ ├── S3 versioning enabled? → restore previous version
│ └── No backup? → terraform import for every resource
│
├── Plan shows destroying everything?
│ ├── Wrong workspace? → terraform workspace list
│ ├── Wrong backend? → check backend.tf
│ └── State refreshed to empty? → restore from backup
│
├── Plan shows unwanted changes?
│ ├── Manual console change? → terraform apply -refresh-only
│ └── Provider version change? → check .terraform.lock.hcl
│
└── Resources orphaned? (exist in cloud, not in state)
terraform import <resource_type>.<name> <cloud_id>
Flashcard Check¶
Q1: What does Terraform state contain?
A JSON mapping between your
.tfresource blocks and real cloud resource IDs. Without it, Terraform doesn't know what infrastructure it manages.
Q2: Why use remote state with locking?
Local state + multiple operators = state corruption. Remote state (S3) provides shared access; DynamoDB locking prevents simultaneous writes.
Q3: terraform plan shows "destroy and recreate" for everything. Why?
State is missing, empty, or pointing to the wrong backend. Terraform thinks no resources exist. Do NOT apply — restore state from backup or import.
Q4: Someone changed an AWS resource manually. What does Terraform do?
Next
planproposes reverting the change (making reality match config). Useterraform apply -refresh-onlyto update state, then update.tfto match.
Q5: force-unlock — when is it safe?
Only when you've confirmed nobody is running
terraform apply. If someone is, force-unlock causes state corruption (two writers, no lock).
Cheat Sheet¶
| Task | Command |
|---|---|
| Check current workspace | terraform workspace show |
| List all workspaces | terraform workspace list |
| Refresh state from reality | terraform apply -refresh-only |
| Import existing resource | terraform import TYPE.NAME CLOUD_ID |
| Force-unlock state | terraform force-unlock LOCK_ID |
| Show state contents | terraform state list |
| Show specific resource | terraform state show TYPE.NAME |
| Move resource in state | terraform state mv OLD NEW |
| Remove from state (don't destroy) | terraform state rm TYPE.NAME |
| S3 state versions | aws s3api list-object-versions --bucket BUCKET --prefix KEY |
Takeaways¶
-
State is the source of truth. Lose it, and Terraform thinks your infrastructure doesn't exist. Enable S3 versioning and DynamoDB locking from day one.
-
Never use local state in a team. Even for "just this one project." You will forget, and two people will apply simultaneously.
-
Separate state files per environment. Don't use workspaces to separate prod/staging. Use separate directories with separate backends.
-
terraform importis tedious but essential. When state is lost, importing every resource is the only recovery path. Keep an inventory of your resource IDs. -
Manual console changes create drift. Terraform reverts them on next apply. Either stop making manual changes, or
refresh-only+ update your.tffiles.
Related Lessons¶
- Terraform vs Ansible vs Helm — choosing the right IaC tool
- The Hanging Deploy — when deploys that call Terraform go wrong
- Permission Denied — IAM permissions for Terraform operations