Investigation: Terraform Apply Fails, State Lock Stuck, Root Cause Is DynamoDB Throttle¶
Phase 1: DevOps Tooling Investigation (Dead End)¶
Try to force-unlock:
$ terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890
Do you really want to force-unlock?
Terraform will remove the lock on the remote state.
This will allow local Terraform commands to modify this state, even though it
may be still be in use. Only 'yes' will be accepted to confirm.
Enter a value: yes
Error: Error unlocking the state
Error message: ConditionalCheckFailedException: The conditional request failed
Even force-unlock fails. Check if the DynamoDB item exists:
$ aws dynamodb get-item \
--table-name tf-state-locks \
--key '{"LockID": {"S": "s3://tf-state-prod/vpc/terraform.tfstate"}}' \
--output json
{
"Item": {
"LockID": {"S": "s3://tf-state-prod/vpc/terraform.tfstate"},
"Info": {"S": "{\"ID\":\"a1b2c3d4-...\",\"Operation\":\"OperationTypeApply\",\"Who\":\"ci-runner@github-actions\",\"Version\":\"1.7.3\",\"Created\":\"2026-03-19T10:58:22.000Z\"}"}
}
}
The lock entry exists. Try to delete it directly:
$ aws dynamodb delete-item \
--table-name tf-state-locks \
--key '{"LockID": {"S": "s3://tf-state-prod/vpc/terraform.tfstate"}}'
An error occurred (ProvisionedThroughputExceededException): The level of configured
provisioned throughput for the table was exceeded. Consider increasing your provisioning
level with the UpdateTable API.
DynamoDB is throttling the request. The table is exceeding its provisioned throughput.
The Pivot¶
The DynamoDB table is being throttled. This is not a Terraform lock problem — it is a DynamoDB capacity problem. Check the table settings:
$ aws dynamodb describe-table --table-name tf-state-locks \
--query 'Table.{ReadCapacity: ProvisionedThroughput.ReadCapacityUnits, WriteCapacity: ProvisionedThroughput.WriteCapacityUnits, ItemCount: ItemCount}'
{
"ReadCapacity": 5,
"WriteCapacity": 5,
"ItemCount": 1
}
5 RCU and 5 WCU for a table with 1 item. That should be plenty. Why is it throttled?
Phase 2: Cloud Investigation (Root Cause)¶
Check the DynamoDB CloudWatch metrics:
$ aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ThrottledRequests \
--dimensions Name=TableName,Value=tf-state-locks \
--start-time 2026-03-19T10:00:00Z \
--end-time 2026-03-19T11:30:00Z \
--period 300 \
--statistics Sum
{
"Datapoints": [
{"Timestamp": "2026-03-19T10:55:00Z", "Sum": 0.0},
{"Timestamp": "2026-03-19T11:00:00Z", "Sum": 847.0},
{"Timestamp": "2026-03-19T11:05:00Z", "Sum": 1204.0},
{"Timestamp": "2026-03-19T11:10:00Z", "Sum": 1189.0},
{"Timestamp": "2026-03-19T11:15:00Z", "Sum": 1156.0}
]
}
Massive throttling starting at 11:00. But Terraform should only make a few requests. What else is hitting this table?
$ aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ConsumedWriteCapacityUnits \
--dimensions Name=TableName,Value=tf-state-locks \
--start-time 2026-03-19T10:00:00Z \
--end-time 2026-03-19T11:30:00Z \
--period 300 \
--statistics Sum
{
"Datapoints": [
{"Timestamp": "2026-03-19T10:55:00Z", "Sum": 2.0},
{"Timestamp": "2026-03-19T11:00:00Z", "Sum": 1520.0},
{"Timestamp": "2026-03-19T11:05:00Z", "Sum": 1480.0}
]
}
1500+ WCUs consumed in a 5-minute window against a 5 WCU provisioned table. Check CloudTrail for what is making these requests:
$ aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=tf-state-locks \
--start-time 2026-03-19T11:00:00Z \
--max-items 5 | jq '.Events[].Username' | sort | uniq -c | sort -rn
1487 "arn:aws:iam::123456789:role/load-test-role"
3 "arn:aws:iam::123456789:role/ci-runner-role"
A load test role is hammering the DynamoDB table. Check what:
$ aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=Username,AttributeValue=arn:aws:iam::123456789:role/load-test-role \
--start-time 2026-03-19T11:00:00Z \
--max-items 3 | jq '.Events[0].CloudTrailEvent' -r | jq '.requestParameters'
{
"tableName": "tf-state-locks",
"item": {"LockID": {"S": "load-test-run-..."}}
}
A load test running in the same AWS account is using the tf-state-locks DynamoDB table as a general-purpose coordination lock table — it was never designed for this. The load test started at 11:00 and is making hundreds of writes per second, consuming all the provisioned throughput and causing Terraform's lock operations to be throttled.
Domain Bridge: Why This Crossed Domains¶
Key insight: The symptom was a stuck Terraform state lock (devops_tooling), the root cause was DynamoDB throughput exhaustion from a load test using the same table (cloud), and the fix requires observability and monitoring changes to detect and prevent resource contention. This is common because: shared cloud resources (DynamoDB tables, S3 buckets, IAM roles) can have unintended consumers. Without resource-level monitoring, contention between unrelated systems goes undetected until one system fails.
Root Cause¶
A load test running in the production AWS account was using the Terraform state lock DynamoDB table (tf-state-locks) as a general-purpose coordination store. The load test was making hundreds of writes per second, exhausting the 5 WCU provisioned capacity. Terraform's lock acquire, hold, and release operations were all throttled, causing locks to get stuck and force-unlock to fail.