- devops
- l2
- runbook
- terraform --- Portal | Level: L2: Operations | Topics: Terraform | Domain: DevOps & Tooling
Runbook: Terraform State Lock Stuck¶
| Field | Value |
|---|---|
| Domain | Cloud/Terraform |
| Alert | terraform plan or apply fails with "Error acquiring the state lock" |
| Severity | P2 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | Terraform CLI, cloud provider CLI (aws/gcloud/az), state backend access (S3+DynamoDB or GCS or Terraform Cloud) |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
terraform plan 2>&1 | grep -A10 "state lock\|Error acquiring"
Step 1: Identify the Lock — Read All Lock Details¶
Why: The lock error message contains the Lock ID, the holder, and the operation in progress. You need all of this to verify whether the lock is legitimate or orphaned.
# Run terraform plan and capture the full lock error:
terraform plan 2>&1 | tee /tmp/tf-lock-error.txt
cat /tmp/tf-lock-error.txt
# The error output will look similar to this:
# Error: Error acquiring the state lock
#
# Error message: ConditionalCheckFailedException: ...
# Lock Info:
# ID: <LOCK_ID>
# Path: <STATE_FILE_PATH>
# Operation: OperationTypePlan
# Who: <USERNAME>@<HOSTNAME>
# Version: <TERRAFORM_VERSION>
# Created: <TIMESTAMP>
# Info:
echo "Note the Lock ID, Who, Created time, and Operation from the error message"
Lock Info:
ID: f1234abc-5678-def0-1234-abcdef012345
Path: terraform/prod/terraform.tfstate
Operation: OperationTypeApply
Who: alice@build-runner-7
Version: 1.5.7
Created: 2026-03-19 14:23:01.123456789 +0000 UTC
aws sts get-caller-identity (AWS) or gcloud auth list (GCP).
Step 2: Verify Whether the Lock Holder Is Still Active¶
Why: Force-unlocking while an active terraform apply is running can corrupt the state file — this is the most dangerous mistake in Terraform operations. Always verify first.
# Check if the lock holder process is still running:
# Option A — check CI system (most common case):
# Go to your CI system (GitHub Actions / GitLab CI / Jenkins) and check for running jobs
# on the repo or pipeline that manages this Terraform workspace.
# GitHub Actions:
gh run list --status in_progress --limit 10
# Option B — check with your team:
# Ping the person in the "Who" field of the lock info, or check Slack/Teams for
# anyone who recently ran terraform apply.
# Option C — check how old the lock is:
# If "Created" is more than 2 hours ago, it is almost certainly orphaned.
# Normal terraform operations take <30 minutes; anything older is stale.
date -u
# Compare to the lock "Created" timestamp above.
echo "If the lock is >2 hours old AND no CI job is running, it is safe to force-unlock"
Safe to force-unlock if ALL of the following are true:
1. No CI jobs for this Terraform workspace are currently running
2. No colleague is running terraform apply locally
3. The lock is older than the maximum expected terraform apply duration (usually 30-60 min)
NOT safe to force-unlock if:
- A CI job is running that matches the lock holder's hostname
- A colleague confirms they are mid-apply
Step 3: Force-Unlock the State¶
Why: If the lock holder is definitively gone (orphaned lock from a crashed process), you must remove the lock to allow any Terraform operations to proceed.
# Force-unlock using the Lock ID from Step 1:
# (Terraform will ask for confirmation — type "yes")
terraform force-unlock <LOCK_ID>
# If you need to skip the confirmation prompt (e.g., automated script):
terraform force-unlock -force <LOCK_ID>
Terraform will prompt:
"Do you really want to force-unlock?
Terraform will remove the lock on the remote state.
This will allow local Terraform commands to modify this state, even though it
may be still be in use. Only 'yes' will be accepted to confirm."
After typing "yes":
"Terraform state has been successfully unlocked!"
terraform force-unlock fails with a permissions error, you need write access to the state backend. Check your AWS/GCP credentials and ensure you have the correct IAM permissions for the DynamoDB table (AWS) or GCS bucket (GCP).
Step 4: Verify the Lock Is Gone Using the Backend Directly (Optional but Recommended)¶
Why: Confirming the lock is gone at the backend level (not just in Terraform's view) gives you confidence that a subsequent terraform plan will succeed.
# AWS S3 + DynamoDB backend — check the lock table directly:
aws dynamodb get-item \
--table-name <LOCK_TABLE_NAME> \
--key '{"LockID": {"S": "<STATE_FILE_PATH>"}}' \
--region <REGION>
# If the item still exists after force-unlock (rare), delete it manually:
aws dynamodb delete-item \
--table-name <LOCK_TABLE_NAME> \
--key '{"LockID": {"S": "<STATE_FILE_PATH>"}}' \
--region <REGION>
# GCS backend — check for lock file:
gsutil ls gs://<BUCKET_NAME>/<STATE_PATH>.lock
# If the lock file exists, delete it:
gsutil rm gs://<BUCKET_NAME>/<STATE_PATH>.lock
# Terraform Cloud / Enterprise — unlock via the UI:
# Go to Terraform Cloud → Workspace → Settings → Force Unlock
AWS DynamoDB get-item: after force-unlock, should return an empty response (no item).
If get-item still returns an item with the lock data, the manual delete step is needed.
GCS: gsutil ls should return "no URLs matched" after unlock.
dynamodb:GetItem and dynamodb:DeleteItem permissions on the lock table ARN.
Step 5: Re-Run Terraform Plan to Confirm¶
Why: The ultimate test is that Terraform can now successfully acquire the lock and run — verify before declaring the incident resolved.
# Run terraform plan to confirm the lock is gone and the workspace is healthy:
terraform plan -out=tfplan
# If you want to apply the plan immediately (only if that was the original intent):
terraform apply tfplan
terraform plan succeeds without any lock errors:
"Refreshing Terraform state in-memory prior to plan..."
"No changes. Your infrastructure matches the configuration." (or a plan showing expected changes)
No "Error acquiring the state lock" message.
Step 6: Investigate Why the Lock Was Orphaned¶
Why: Orphaned locks are usually symptoms of crashed CI runners, network timeouts, or manual Ctrl+C interruptions. Understanding the cause prevents recurrence.
# Check CI system logs for the job that matches the lock holder:
# Look for job failures, timeout errors, or "runner lost connection" messages
# around the lock's "Created" timestamp.
# Common causes:
# 1. CI runner was killed/timed out mid-apply
# 2. Developer ran terraform apply locally and interrupted it with Ctrl+C
# 3. Network connectivity loss during apply
# 4. Instance running Terraform was terminated (spot instance, autoscaling)
# To prevent future orphaned locks:
# - Set a CI job timeout that matches your longest expected terraform apply
# - Use Terraform Cloud which handles locking more robustly
# - Add a CI cleanup step that runs terraform force-unlock if the job fails unexpectedly
echo "Document the root cause in the incident log"
A clear root cause, e.g.:
"CI runner instance was a spot instance that was reclaimed during the apply"
"Developer confirmed they hit Ctrl+C while running terraform apply locally"
"GitHub Actions job timed out after 60 minutes — apply was still in progress"
Verification¶
Success looks like:terraform plan runs without any "Error acquiring the state lock" message and completes successfully.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | Platform/Infra on-call | "P2: Terraform state lock stuck for >30 min, force-unlock not working, backend may be degraded" |
| Backend appears down | Cloud/Infra on-call | "Terraform state backend ( |
| Security incident | Security on-call | "Security incident: unauthorized terraform apply in progress on |
| State file may be corrupted | Platform/Infra on-call | "Terraform state file may be corrupted after a failed apply — need expert review before proceeding" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Add CI job timeouts to cap the maximum time any Terraform job can hold a lock
- Consider adding a "lock watchdog" alert: if any lock is held for >60 min, page the on-call
- Review if the orphaned lock caused a partial apply — run
terraform planto check for drift - Document the root cause in the team's runbook learnings
Common Mistakes¶
- Force-unlocking without verifying no active apply is running: This is the most dangerous mistake — if an apply is in progress and you force-unlock, two concurrent applies can run simultaneously and corrupt the state file. Always check CI and ask colleagues first.
- Not checking CI system for concurrent runs: Automated CI pipelines often trigger multiple runs. A lock held by a CI job that is still running should not be force-unlocked.
- Running terraform apply without a plan first after force-unlock: After a force-unlock, always run
terraform planfirst to understand the state before applying anything. - Ignoring the lock's age: A 5-minute-old lock may be an active apply. A 3-hour-old lock is certainly orphaned. Use the "Created" timestamp to calibrate your urgency.
- Not investigating why the lock was orphaned: Force-unlocking is a workaround. Without fixing the root cause (CI timeout, spot instance interruption, etc.), the same orphaned lock will recur.
Cross-References¶
- Topic Pack:
training/library/topic-packs/cloud-terraform/(deep background on Terraform state management) - Related Runbook: drift-detection.md — if a partial apply left infrastructure in a drifted state
- Related Runbook: capacity-limit.md — if the terraform apply was failing due to quota limits and the runner timed out
Wiki Navigation¶
Related Content¶
- Case Study: SSH Timeout — MTU Mismatch, Fix Is Terraform Variable (Case Study, L2) — Terraform
- Case Study: Terraform Apply Fails — State Lock Stuck, DynamoDB Throttle (Case Study, L2) — Terraform
- Crossplane (Topic Pack, L2) — Terraform
- Deep Dive: Terraform State Internals (deep_dive, L2) — Terraform
- Mental Models (Core Concepts) (Topic Pack, L0) — Terraform
- OpenTofu & Terraform Ecosystem (Topic Pack, L2) — Terraform
- Pulumi (Topic Pack, L2) — Terraform
- Runbook: Cloud Capacity Limit Hit (Runbook, L2) — Terraform
- Runbook: Terraform Drift Detection Response (Runbook, L2) — Terraform
- Skillcheck: Terraform / IaC (Assessment, L1) — Terraform
Pages that link here¶
- Crossplane
- Crossplane - Primer
- Infrastructure as Code with Terraform - Primer
- OpenTofu & Terraform Ecosystem - Primer
- Opentofu
- Operational Runbooks
- Pulumi
- Pulumi - Primer
- Runbook: Cloud Capacity Limit Hit
- Runbook: Terraform Drift Detection Response
- Symptoms: Terraform Apply Fails, State Lock Stuck, Root Cause Is DynamoDB Throttle
- Terraform / Infrastructure as Code - Skill Check
- Terraform Deep Dive
- Terraform Deep Dive - Primer
- Terraform Drills