Remediation: Terraform Apply Fails, State Lock Stuck, Root Cause Is DynamoDB Throttle¶
Immediate Fix (Observability — Domain C)¶
The fix requires stopping the offending load test, clearing the lock, and adding monitoring to prevent recurrence.
Step 1: Stop the load test¶
# Identify the load test
$ aws ecs list-tasks --cluster load-test-cluster --service-name load-test-coordinator
{
"taskArns": ["arn:aws:ecs:us-east-1:123456789:task/load-test-cluster/abc123"]
}
$ aws ecs stop-task --cluster load-test-cluster --task abc123 --reason "Disrupting terraform state lock table"
Step 2: Clear the stuck lock¶
# Now that throttling has stopped, delete the lock entry directly
$ aws dynamodb delete-item \
--table-name tf-state-locks \
--key '{"LockID": {"S": "s3://tf-state-prod/vpc/terraform.tfstate"}}'
# (succeeds now that throttling is gone)
# Verify Terraform can acquire the lock
$ terraform plan
Acquiring state lock. This may take a few moments...
# (succeeds)
Step 3: Move the load test to its own DynamoDB table¶
# Create a separate table for the load test
$ aws dynamodb create-table \
--table-name load-test-coordination \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
# Update the load test configuration to use the new table
Step 4: Add DynamoDB throttle monitoring¶
# CloudWatch alarm for throttled requests
$ aws cloudwatch put-metric-alarm \
--alarm-name "tf-state-locks-throttled" \
--metric-name ThrottledRequests \
--namespace AWS/DynamoDB \
--dimensions Name=TableName,Value=tf-state-locks \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts
Step 5: Switch to on-demand billing for the lock table¶
Verification¶
Domain A (DevOps Tooling) — Terraform operations work¶
$ terraform plan
Acquiring state lock. This may take a few moments...
No changes. Your infrastructure matches the configuration.
$ terraform apply -auto-approve
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
Domain B (Cloud) — DynamoDB healthy, no throttling¶
$ aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ThrottledRequests \
--dimensions Name=TableName,Value=tf-state-locks \
--start-time 2026-03-19T11:30:00Z \
--end-time 2026-03-19T12:00:00Z \
--period 300 \
--statistics Sum
{
"Datapoints": [
{"Timestamp": "2026-03-19T11:30:00Z", "Sum": 0.0},
{"Timestamp": "2026-03-19T11:35:00Z", "Sum": 0.0}
]
}
Domain C (Observability) — Monitoring in place¶
$ aws cloudwatch describe-alarms --alarm-names "tf-state-locks-throttled"
{
"MetricAlarms": [{
"AlarmName": "tf-state-locks-throttled",
"StateValue": "OK"
}]
}
Prevention¶
-
Monitoring: Add CloudWatch alarms for DynamoDB
ThrottledRequestsandConsumedWriteCapacityUnitson all infrastructure tables. Alert immediately on any throttling. -
Runbook: Terraform state lock DynamoDB tables must have resource policies that restrict write access to CI runner roles only. No other workloads may use these tables.
-
Architecture: Use AWS DynamoDB resource policies or IAM condition keys to restrict which IAM roles can write to the lock table. Consider on-demand billing for lock tables (they are low-traffic but critical). Tag all infrastructure tables with
purpose:terraform-statefor auditing.