cloud
l2
runbook
cloud-deep-dive
terraform --- Portal | Level: L2: Operations | Topics: Cloud Deep Dive, Terraform | Domain: Cloud

Runbook: Cloud Capacity Limit Hit¶

Field	Value
Domain	Cloud/Terraform
Alert	Resource creation failing with quota/limit error, or scaling event failing
Severity	P1 (if blocking scaling during incident), P2 (if blocking new deployments)
Est. Resolution Time	30-120 minutes
Escalation Timeout	30 minutes — page if not resolved (quota increases require human action with the cloud provider)
Last Tested	2026-03-19
Prerequisites	Cloud provider CLI, cloud console access, ability to submit quota increase requests

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
# For AWS — check what failed and what the limit is:
aws ec2 describe-instances 2>&1 | grep -i "LimitExceeded\|RequestLimitExceeded\|quota"
# Or check the terraform/deployment error message directly:
# The error will name the specific quota — e.g. "You have requested more vCPU capacity than your current vCPU limit"
echo "Read the exact error message — it names the quota that was hit"

If output shows: "LimitExceeded" or "QuotaExceeded" with a specific resource type → You've confirmed the limit; note the quota name and proceed to Step 1 If output shows: a different error (insufficient permissions, wrong region, missing VPC) → This is a different problem, not a quota issue — check your Terraform/deployment config

Step 1: Identify the Exact Quota That Was Hit¶

Why: Cloud providers have hundreds of different quotas. Knowing the exact quota name allows you to check the current usage, find workarounds, and submit the correct increase request.

# Read the full error message carefully — it will name the quota:
# Examples:
#   AWS: "You have requested more instances (X) than your current instance limit (Y) allows"
#   AWS: "The maximum number of VPCs has been reached"
#   GCP: "Quota 'CPUS_ALL_REGIONS' exceeded. Limit: 24.0, got: 32.0"
#   Azure: "Operation could not be completed as it results in exceeding approved Total Regional vCPUs quota"

# AWS — list all current quotas and their limits for EC2:
aws service-quotas list-service-quotas --service-code ec2 \
  --output json | jq '.Quotas[] | {Name, Value, Adjustable}' | head -50

# Find a specific quota by name (example: vCPU limit):
aws service-quotas list-service-quotas --service-code ec2 \
  --output json | jq '.Quotas[] | select(.QuotaName | contains("vCPU")) | {QuotaName, Value, QuotaCode}'

# GCP — check current quotas:
gcloud compute project-info describe --format=json | jq '.quotas[] | {metric, limit, usage}'

# Azure — check VM quota:
az vm list-usage --location <REGION> --output table | grep -i "vCPU\|cores"

Expected output:

AWS service-quotas output:
  {
    "QuotaName": "Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances",
    "Value": 32,
    "QuotaCode": "L-1216C47A",
    "Adjustable": true
  }

GCP quotas output:
  {"metric": "CPUS", "limit": 24, "usage": 24}   ← at the limit

If this fails: If aws service-quotas returns no results, the quota may be a region-specific limit. Add --region <REGION> to the command, or check the AWS console: Service Quotas → Amazon EC2.

Step 2: Check Current Usage vs. the Limit¶

Why: Before requesting a limit increase, confirm you are actually at the limit (not a configuration error). Also, understanding current usage helps you find resources to free up as an immediate workaround.

# AWS — check current EC2 instance usage by type:
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running,pending,stopping,stopped" \
  --output json | jq '.Reservations[].Instances[] | {InstanceType, State: .State.Name}' | \
  jq -s 'group_by(.InstanceType) | map({type: .[0].InstanceType, count: length})'

# Check a specific service quota's current usage vs limit:
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code <QUOTA_CODE> \
  --output json | jq '{QuotaName, Value}'

# GCP — check vCPU usage in a region:
gcloud compute regions describe <REGION> \
  --format="json" | jq '.quotas[] | select(.metric=="CPUS") | {limit, usage}'

# Azure — check current vCPU usage:
az vm list-usage --location <REGION> \
  --output json | jq '.[] | select(.name.value | contains("vCPU")) | {name: .name.localizedValue, current: .currentValue, limit: .limit}'

Expected output:

AWS instance count per type:
  [{"type": "t3.medium", "count": 12}, {"type": "m5.large", "count": 20}]

AWS quota check:
  {"QuotaName": "Running On-Demand Standard instances", "Value": 32}
  → If your total running instance vCPUs equals 32, you are at the limit.

GCP:
  {"limit": 24, "usage": 24}   ← confirmed at limit

If this fails: If you cannot determine current usage programmatically, check the cloud console quota dashboard: AWS → Service Quotas → EC2; GCP → IAM & Admin → Quotas; Azure → Subscriptions → Usage + quotas.

Step 3: Implement an Immediate Workaround (While Waiting for Quota Increase)¶

Why: Quota increase requests can take minutes to hours. If the limit is blocking an active incident (scaling needed NOW), a workaround is essential while the increase is being processed.

# Workaround Option A — Use a different AWS region:
# Some quotas are per-region. If us-east-1 is at limit, try us-west-2.
# Check quota in alternate region:
aws service-quotas list-service-quotas --service-code ec2 --region us-west-2 \
  --output json | jq '.Quotas[] | select(.QuotaName | contains("vCPU")) | {QuotaName, Value}'

# Workaround Option B — Use a different instance type:
# Quota limits are often per-instance-family (Standard, High Memory, etc.)
# Check if a different family has quota available:
aws service-quotas list-service-quotas --service-code ec2 \
  --output json | jq '.Quotas[] | select(.QuotaName | contains("Running On-Demand")) | {QuotaName, Value}'

# Workaround Option C — Terminate idle or unused resources to free up quota:
# Find stopped EC2 instances that can be terminated:
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=stopped" \
  --output json | jq '.Reservations[].Instances[] | {InstanceId, InstanceType, LaunchTime}'

# Terminate specific unused instances (CONFIRM BEFORE RUNNING):
# aws ec2 terminate-instances --instance-ids <INSTANCE_ID_1> <INSTANCE_ID_2>

# Workaround Option D — Right-size: use fewer, larger instances instead of many small ones
# (Depends on your workload — check if this is feasible)
echo "Document which workaround you used and its impact in the incident log"

Expected output:

Option A: quota in alternate region shows headroom (e.g., value: 96 vs 32 in the original region)
Option B: a different instance family has available quota
Option C: a list of stopped instances that can be terminated to free up quota

If this fails: If no workaround is available (all options exhausted), escalate immediately — the business may need to accept reduced capacity while waiting for the quota increase.

Step 4: Submit a Quota Increase Request¶

Why: The quota limit must be raised to support long-term capacity. This requires action in the cloud provider's console and may take time — submit immediately even if a workaround is in place.

# AWS — request quota increase via CLI:
aws service-quotas request-service-quota-increase \
  --service-code ec2 \
  --quota-code <QUOTA_CODE> \
  --desired-value <NEW_LIMIT>

# Or via the AWS Console (often faster for approvals):
# AWS Console → Service Quotas → Amazon EC2 → find the quota → Request quota increase
# Provide business justification in the reason field.

# Check status of submitted request:
aws service-quotas list-requested-changes-by-service --service-code ec2 \
  --output json | jq '.RequestedQuotas[] | {QuotaName, DesiredValue, Status, Created}'

# GCP — request quota increase:
# GCP Console → IAM & Admin → Quotas → find the quota → Edit Quotas
# Fill in the new value and business justification.

# Azure — request quota increase:
# Azure Portal → Subscriptions → Usage + quotas → Request increase
# Or: az support tickets create (for VM quota increases)

echo "Note the request ID and expected approval time — some increases are automatic, others require review"

Expected output:

AWS CLI:
  {
    "RequestedQuota": {
      "Id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
      "Status": "PENDING",
      "QuotaName": "Running On-Demand Standard instances",
      "DesiredValue": 64
    }
  }

Status will transition: PENDING → APPROVED (automatic for small increases) or CASE_OPENED (needs AWS review)

If this fails: If the AWS CLI request fails with "Cannot increase this quota", some quotas can only be increased via a support case. Go to AWS Support → Create case → Service limit increase.

Step 5: Monitor and Unblock the Scaling Event¶

Why: Once quota is increased (or a workaround is in place), you need to confirm the blocked operation can now proceed.

# After quota increase is approved (or workaround in place), retry the failing operation:
# Terraform:
terraform apply -target=<RESOURCE_TYPE>.<RESOURCE_NAME>

# Kubernetes HPA scale:
kubectl get hpa -n <NAMESPACE>
kubectl describe hpa <HPA_NAME> -n <NAMESPACE> | grep -A5 "Events"

# ASG (Auto Scaling Group):
aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name <ASG_NAME> \
  --output json | jq '.Activities[0] | {ActivityId, Description, StatusCode, StatusMessage}'

# Verify the new resources were created:
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running,pending" \
  --output json | jq '.Reservations | length'

Expected output:

Terraform apply: "Apply complete! Resources: X added."
Kubernetes HPA: events show "scaled up" rather than "failed to scale"
ASG activity: {"StatusCode": "Successful", "StatusMessage": "..."}
Instance count has increased as expected.

If this fails: If the operation still fails after quota increase approval, the increase may not have propagated yet (AWS can take a few minutes). Wait 5-10 minutes and retry. If it still fails, verify the new quota value is reflected: aws service-quotas get-service-quota --service-code ec2 --quota-code <QUOTA_CODE>.

Step 6: Document the Limit and Add Prevention¶

Why: Hitting a quota limit silently in a scaling event is a reliability risk. Document the limit and add visibility so you can act before the next incident.

# Add the quota limit as a Terraform variable or output for visibility:
# In your Terraform code (example — add to outputs.tf):
# output "ec2_vcpu_quota_limit" {
#   value       = 64  # current approved limit
#   description = "Current approved vCPU limit in us-east-1 — submit increase request if usage approaches this"
# }

# Add a CloudWatch alarm (AWS) to alert when usage approaches the limit:
aws cloudwatch put-metric-alarm \
  --alarm-name "EC2-vCPU-Usage-High" \
  --alarm-description "EC2 vCPU usage approaching quota limit" \
  --namespace "AWS/Usage" \
  --metric-name "ResourceCount" \
  --dimensions Name=Service,Value=EC2 Name=Resource,Value=vCPU Name=Type,Value=Resource \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --evaluation-periods 1 \
  --alarm-actions <SNS_TOPIC_ARN>

echo "Document the limit in team wiki and set an alert at 80% of quota"

Expected output:

CloudWatch alarm created: "OK" state, will trigger when vCPU count reaches 80% of quota limit.

If this fails: If CloudWatch metric is not available for this quota type, use a Lambda function to periodically check quota usage and publish a custom metric.

Verification¶

# Confirm the issue is resolved
aws service-quotas get-service-quota --service-code ec2 --quota-code <QUOTA_CODE> \
  --output json | jq '{QuotaName, Value}'
# Verify the new limit is higher than what you need

Success looks like: Quota increase approved and the new limit value is visible. The previously-failing scaling event or Terraform apply now completes successfully. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 30 min (blocking incident)	Platform/Infra on-call	"P1: Quota limit blocking scaling during active incident — need emergency workaround or escalation to cloud provider"
Quota increase rejected	Platform/Infra on-call + Engineering Manager	"Cloud provider rejected quota increase request — need management escalation or alternative architecture decision"
Security incident	Security on-call	"Security incident: resource creation by unexpected actor is exhausting quota — possible crypto-jacking"
No workaround available	Engineering Manager	"Quota limit reached with no workaround — service cannot scale, customer impact possible"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Set quota usage alerts at 70-80% of current limits for all critical quotas
Pre-request quota increases proactively for expected growth before hitting limits
Document all quota limits and approved values in team wiki
Review whether any unused resources can be terminated to free up headroom

Common Mistakes¶

Submitting a quota increase request without a workaround: Quota increase approvals take time (minutes to hours). If capacity is needed now, implement a workaround (different region, instance type, cleanup) while waiting — don't just wait for the increase.
Not checking all quotas: A vCPU limit is separate from an instance count limit, EIP limit, VPC limit, and security group limit. Fixing one quota may reveal you are also at another limit — check all related quotas together.
Forgetting that quotas are per-region on AWS: Hitting the vCPU limit in us-east-1 does not affect us-west-2. If you have a multi-region deployment, check each region separately.
Not documenting the limit after the incident: Teams regularly re-hit the same quota because the limit was not documented. Add it to your wiki and set an alert at 80% of the limit.
Terminating instances to free up quota without checking if they are in use: A "stopped" instance may be stopped for a reason (DR standby, scheduled maintenance). Confirm before terminating.

Cross-References¶

Topic Pack: training/library/topic-packs/cloud-terraform/ (deep background on cloud quotas and capacity planning)
Related Runbook: terraform-state-lock.md — if the terraform apply is stuck due to a lock in addition to quota errors
Related Runbook: drift-detection.md — if quota exhaustion led to partial resource creation and resulting drift
Related Runbook: ../kubernetes/hpa_not_scaling.md — if the quota limit is preventing Kubernetes HPA from scaling

AWS CloudWatch (Topic Pack, L2) — Cloud Deep Dive
AWS Devops Flashcards (CLI) (flashcard_deck, L1) — Cloud Deep Dive
AWS EC2 (Topic Pack, L1) — Cloud Deep Dive
AWS ECS (Topic Pack, L2) — Cloud Deep Dive
AWS General Flashcards (CLI) (flashcard_deck, L1) — Cloud Deep Dive
AWS IAM (Topic Pack, L1) — Cloud Deep Dive
AWS Lambda (Topic Pack, L2) — Cloud Deep Dive
AWS Networking (Topic Pack, L1) — Cloud Deep Dive
AWS Route 53 (Topic Pack, L2) — Cloud Deep Dive
AWS S3 Deep Dive (Topic Pack, L1) — Cloud Deep Dive

Runbook: Cloud Capacity Limit Hit¶

Quick Assessment (30 seconds)¶

Step 1: Identify the Exact Quota That Was Hit¶

Step 2: Check Current Usage vs. the Limit¶

Step 3: Implement an Immediate Workaround (While Waiting for Quota Increase)¶

Step 4: Submit a Quota Increase Request¶

Step 5: Monitor and Unblock the Scaling Event¶

Step 6: Document the Limit and Add Prevention¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Cloud Capacity Limit Hit¶

Quick Assessment (30 seconds)¶

Step 1: Identify the Exact Quota That Was Hit¶

Step 2: Check Current Usage vs. the Limit¶

Step 3: Implement an Immediate Workaround (While Waiting for Quota Increase)¶

Step 4: Submit a Quota Increase Request¶

Step 5: Monitor and Unblock the Scaling Event¶

Step 6: Document the Limit and Add Prevention¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶