Skip to content

Remediation: Terraform Apply Fails, State Lock Stuck, Root Cause Is DynamoDB Throttle

Immediate Fix (Observability — Domain C)

The fix requires stopping the offending load test, clearing the lock, and adding monitoring to prevent recurrence.

Step 1: Stop the load test

# Identify the load test
$ aws ecs list-tasks --cluster load-test-cluster --service-name load-test-coordinator
{
    "taskArns": ["arn:aws:ecs:us-east-1:123456789:task/load-test-cluster/abc123"]
}

$ aws ecs stop-task --cluster load-test-cluster --task abc123 --reason "Disrupting terraform state lock table"

Step 2: Clear the stuck lock

# Now that throttling has stopped, delete the lock entry directly
$ aws dynamodb delete-item \
    --table-name tf-state-locks \
    --key '{"LockID": {"S": "s3://tf-state-prod/vpc/terraform.tfstate"}}'
# (succeeds now that throttling is gone)

# Verify Terraform can acquire the lock
$ terraform plan
Acquiring state lock. This may take a few moments...
# (succeeds)

Step 3: Move the load test to its own DynamoDB table

# Create a separate table for the load test
$ aws dynamodb create-table \
    --table-name load-test-coordination \
    --attribute-definitions AttributeName=LockID,AttributeType=S \
    --key-schema AttributeName=LockID,KeyType=HASH \
    --billing-mode PAY_PER_REQUEST

# Update the load test configuration to use the new table

Step 4: Add DynamoDB throttle monitoring

# CloudWatch alarm for throttled requests
$ aws cloudwatch put-metric-alarm \
    --alarm-name "tf-state-locks-throttled" \
    --metric-name ThrottledRequests \
    --namespace AWS/DynamoDB \
    --dimensions Name=TableName,Value=tf-state-locks \
    --statistic Sum \
    --period 300 \
    --evaluation-periods 1 \
    --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts

Step 5: Switch to on-demand billing for the lock table

$ aws dynamodb update-table \
    --table-name tf-state-locks \
    --billing-mode PAY_PER_REQUEST

Verification

Domain A (DevOps Tooling) — Terraform operations work

$ terraform plan
Acquiring state lock. This may take a few moments...
No changes. Your infrastructure matches the configuration.

$ terraform apply -auto-approve
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Domain B (Cloud) — DynamoDB healthy, no throttling

$ aws cloudwatch get-metric-statistics \
    --namespace AWS/DynamoDB \
    --metric-name ThrottledRequests \
    --dimensions Name=TableName,Value=tf-state-locks \
    --start-time 2026-03-19T11:30:00Z \
    --end-time 2026-03-19T12:00:00Z \
    --period 300 \
    --statistics Sum
{
    "Datapoints": [
        {"Timestamp": "2026-03-19T11:30:00Z", "Sum": 0.0},
        {"Timestamp": "2026-03-19T11:35:00Z", "Sum": 0.0}
    ]
}

Domain C (Observability) — Monitoring in place

$ aws cloudwatch describe-alarms --alarm-names "tf-state-locks-throttled"
{
    "MetricAlarms": [{
        "AlarmName": "tf-state-locks-throttled",
        "StateValue": "OK"
    }]
}

Prevention

  • Monitoring: Add CloudWatch alarms for DynamoDB ThrottledRequests and ConsumedWriteCapacityUnits on all infrastructure tables. Alert immediately on any throttling.

  • Runbook: Terraform state lock DynamoDB tables must have resource policies that restrict write access to CI runner roles only. No other workloads may use these tables.

  • Architecture: Use AWS DynamoDB resource policies or IAM condition keys to restrict which IAM roles can write to the lock table. Consider on-demand billing for lock tables (they are low-traffic but critical). Tag all infrastructure tables with purpose:terraform-state for auditing.