Skip to content

Runbook: Terraform State Lock Stuck

Field Value
Domain Cloud/Terraform
Alert terraform plan or apply fails with "Error acquiring the state lock"
Severity P2
Est. Resolution Time 15-30 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites Terraform CLI, cloud provider CLI (aws/gcloud/az), state backend access (S3+DynamoDB or GCS or Terraform Cloud)

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
terraform plan 2>&1 | grep -A10 "state lock\|Error acquiring"
If output shows: a Lock ID and an ID/path for the holder → The lock info tells you who holds it; proceed to Step 2 to verify if that process is still alive If output shows: "Error: Failed to retrieve state" or backend connectivity errors → This is a different problem (backend unreachable), not a stuck lock — check your cloud credentials and VPN connectivity

Step 1: Identify the Lock — Read All Lock Details

Why: The lock error message contains the Lock ID, the holder, and the operation in progress. You need all of this to verify whether the lock is legitimate or orphaned.

# Run terraform plan and capture the full lock error:
terraform plan 2>&1 | tee /tmp/tf-lock-error.txt
cat /tmp/tf-lock-error.txt

# The error output will look similar to this:
# Error: Error acquiring the state lock
#
# Error message: ConditionalCheckFailedException: ...
# Lock Info:
#   ID:        <LOCK_ID>
#   Path:      <STATE_FILE_PATH>
#   Operation: OperationTypePlan
#   Who:       <USERNAME>@<HOSTNAME>
#   Version:   <TERRAFORM_VERSION>
#   Created:   <TIMESTAMP>
#   Info:
echo "Note the Lock ID, Who, Created time, and Operation from the error message"
Expected output:
Lock Info:
  ID:        f1234abc-5678-def0-1234-abcdef012345
  Path:      terraform/prod/terraform.tfstate
  Operation: OperationTypeApply
  Who:       alice@build-runner-7
  Version:   1.5.7
  Created:   2026-03-19 14:23:01.123456789 +0000 UTC
If this fails: If the terraform command hangs instead of erroring, the backend itself may be unreachable. Interrupt with Ctrl+C and check your cloud credentials: aws sts get-caller-identity (AWS) or gcloud auth list (GCP).

Step 2: Verify Whether the Lock Holder Is Still Active

Why: Force-unlocking while an active terraform apply is running can corrupt the state file — this is the most dangerous mistake in Terraform operations. Always verify first.

# Check if the lock holder process is still running:
# Option A — check CI system (most common case):
# Go to your CI system (GitHub Actions / GitLab CI / Jenkins) and check for running jobs
# on the repo or pipeline that manages this Terraform workspace.
# GitHub Actions:
gh run list --status in_progress --limit 10

# Option B — check with your team:
# Ping the person in the "Who" field of the lock info, or check Slack/Teams for
# anyone who recently ran terraform apply.

# Option C — check how old the lock is:
# If "Created" is more than 2 hours ago, it is almost certainly orphaned.
# Normal terraform operations take <30 minutes; anything older is stale.
date -u
# Compare to the lock "Created" timestamp above.
echo "If the lock is >2 hours old AND no CI job is running, it is safe to force-unlock"
Expected output:
Safe to force-unlock if ALL of the following are true:
  1. No CI jobs for this Terraform workspace are currently running
  2. No colleague is running terraform apply locally
  3. The lock is older than the maximum expected terraform apply duration (usually 30-60 min)

NOT safe to force-unlock if:
  - A CI job is running that matches the lock holder's hostname
  - A colleague confirms they are mid-apply
If this fails: If you cannot determine whether the lock holder is active, ask in your team channel before proceeding. When in doubt, wait 30 minutes and try again.

Step 3: Force-Unlock the State

Why: If the lock holder is definitively gone (orphaned lock from a crashed process), you must remove the lock to allow any Terraform operations to proceed.

# Force-unlock using the Lock ID from Step 1:
# (Terraform will ask for confirmation — type "yes")
terraform force-unlock <LOCK_ID>

# If you need to skip the confirmation prompt (e.g., automated script):
terraform force-unlock -force <LOCK_ID>
Expected output:
Terraform will prompt:
  "Do you really want to force-unlock?
   Terraform will remove the lock on the remote state.
   This will allow local Terraform commands to modify this state, even though it
   may be still be in use. Only 'yes' will be accepted to confirm."

After typing "yes":
  "Terraform state has been successfully unlocked!"
If this fails: If terraform force-unlock fails with a permissions error, you need write access to the state backend. Check your AWS/GCP credentials and ensure you have the correct IAM permissions for the DynamoDB table (AWS) or GCS bucket (GCP).

Why: Confirming the lock is gone at the backend level (not just in Terraform's view) gives you confidence that a subsequent terraform plan will succeed.

# AWS S3 + DynamoDB backend — check the lock table directly:
aws dynamodb get-item \
  --table-name <LOCK_TABLE_NAME> \
  --key '{"LockID": {"S": "<STATE_FILE_PATH>"}}' \
  --region <REGION>

# If the item still exists after force-unlock (rare), delete it manually:
aws dynamodb delete-item \
  --table-name <LOCK_TABLE_NAME> \
  --key '{"LockID": {"S": "<STATE_FILE_PATH>"}}' \
  --region <REGION>

# GCS backend — check for lock file:
gsutil ls gs://<BUCKET_NAME>/<STATE_PATH>.lock

# If the lock file exists, delete it:
gsutil rm gs://<BUCKET_NAME>/<STATE_PATH>.lock

# Terraform Cloud / Enterprise — unlock via the UI:
# Go to Terraform Cloud → Workspace → Settings → Force Unlock
Expected output:
AWS DynamoDB get-item: after force-unlock, should return an empty response (no item).
If get-item still returns an item with the lock data, the manual delete step is needed.
GCS: gsutil ls should return "no URLs matched" after unlock.
If this fails: If the DynamoDB table is inaccessible, check that your AWS credentials have dynamodb:GetItem and dynamodb:DeleteItem permissions on the lock table ARN.

Step 5: Re-Run Terraform Plan to Confirm

Why: The ultimate test is that Terraform can now successfully acquire the lock and run — verify before declaring the incident resolved.

# Run terraform plan to confirm the lock is gone and the workspace is healthy:
terraform plan -out=tfplan

# If you want to apply the plan immediately (only if that was the original intent):
terraform apply tfplan
Expected output:
terraform plan succeeds without any lock errors:
  "Refreshing Terraform state in-memory prior to plan..."
  "No changes. Your infrastructure matches the configuration." (or a plan showing expected changes)

No "Error acquiring the state lock" message.
If this fails: If the lock error appears again immediately after force-unlock with a NEW lock ID, someone else just acquired the lock. Re-check Step 2 — coordinate with your team.

Step 6: Investigate Why the Lock Was Orphaned

Why: Orphaned locks are usually symptoms of crashed CI runners, network timeouts, or manual Ctrl+C interruptions. Understanding the cause prevents recurrence.

# Check CI system logs for the job that matches the lock holder:
# Look for job failures, timeout errors, or "runner lost connection" messages
# around the lock's "Created" timestamp.

# Common causes:
# 1. CI runner was killed/timed out mid-apply
# 2. Developer ran terraform apply locally and interrupted it with Ctrl+C
# 3. Network connectivity loss during apply
# 4. Instance running Terraform was terminated (spot instance, autoscaling)

# To prevent future orphaned locks:
# - Set a CI job timeout that matches your longest expected terraform apply
# - Use Terraform Cloud which handles locking more robustly
# - Add a CI cleanup step that runs terraform force-unlock if the job fails unexpectedly
echo "Document the root cause in the incident log"
Expected output:
A clear root cause, e.g.:
  "CI runner instance was a spot instance that was reclaimed during the apply"
  "Developer confirmed they hit Ctrl+C while running terraform apply locally"
  "GitHub Actions job timed out after 60 minutes — apply was still in progress"
If this fails: If the root cause cannot be determined, document "unknown" and add monitoring for future lock durations.

Verification

# Confirm the issue is resolved
terraform plan 2>&1 | head -5
Success looks like: terraform plan runs without any "Error acquiring the state lock" message and completes successfully. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min Platform/Infra on-call "P2: Terraform state lock stuck for >30 min, force-unlock not working, backend may be degraded"
Backend appears down Cloud/Infra on-call "Terraform state backend () is unreachable — all Terraform operations blocked"
Security incident Security on-call "Security incident: unauthorized terraform apply in progress on — need immediate intervention"
State file may be corrupted Platform/Infra on-call "Terraform state file may be corrupted after a failed apply — need expert review before proceeding"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Add CI job timeouts to cap the maximum time any Terraform job can hold a lock
  • Consider adding a "lock watchdog" alert: if any lock is held for >60 min, page the on-call
  • Review if the orphaned lock caused a partial apply — run terraform plan to check for drift
  • Document the root cause in the team's runbook learnings

Common Mistakes

  1. Force-unlocking without verifying no active apply is running: This is the most dangerous mistake — if an apply is in progress and you force-unlock, two concurrent applies can run simultaneously and corrupt the state file. Always check CI and ask colleagues first.
  2. Not checking CI system for concurrent runs: Automated CI pipelines often trigger multiple runs. A lock held by a CI job that is still running should not be force-unlocked.
  3. Running terraform apply without a plan first after force-unlock: After a force-unlock, always run terraform plan first to understand the state before applying anything.
  4. Ignoring the lock's age: A 5-minute-old lock may be an active apply. A 3-hour-old lock is certainly orphaned. Use the "Created" timestamp to calibrate your urgency.
  5. Not investigating why the lock was orphaned: Force-unlocking is a workaround. Without fixing the root cause (CI timeout, spot instance interruption, etc.), the same orphaned lock will recur.

Cross-References

  • Topic Pack: training/library/topic-packs/cloud-terraform/ (deep background on Terraform state management)
  • Related Runbook: drift-detection.md — if a partial apply left infrastructure in a drifted state
  • Related Runbook: capacity-limit.md — if the terraform apply was failing due to quota limits and the runner timed out

Wiki Navigation