On-Call Survival: Cloud/Infrastructure¶
Print this. Pin it. Read it at 3 AM.
Alert: Cloud Provider Partial Outage¶
Severity: P1 (affecting your region/services)
First command:
# Check provider status pages:
# AWS: https://health.aws.amazon.com/health/status
# GCP: https://status.cloud.google.com
# Azure: https://status.azure.com
# Check your region specifically — not all regions are affected equally.
Decision tree:
Is your specific service listed on the status page?
├── Yes → This is a provider incident, not your code.
│ → Open an incident: "Cloud provider outage affecting <service> in <region>"
│ → Notify stakeholders. Do NOT attempt to fix cloud-side issues.
│ → Check failover options: can you serve from another region/AZ?
└── No → Provider says healthy but you're seeing issues?
├── AZ-specific? → Check if your resources are spread across AZs.
│ kubectl get nodes -o wide (check zones)
│ If single-AZ deployment: escalate to infra for cross-AZ failover.
└── Could be your configuration, not the provider.
→ Check recent Terraform or infra changes: git log --oneline devops/terraform/ | head -5
→ Escalate: "Provider reports healthy but <service> degraded in <region>: <symptoms>"
Escalation trigger: Provider outage lasting > 15 min; no available failover; SLA breach imminent.
Safe actions: Check status pages, check your own resource distribution — read-only.
Dangerous actions: Failover to another region (significant traffic impact), change DNS.
Alert: Terraform Drift / Infrastructure Mismatch¶
Severity: P2
First command:
What you're looking for:Plan: X to add, Y to change, Z to destroy. Changes to destroy are highest risk.
Decision tree:
Are there resources marked for destruction?
├── Yes → STOP. Do not apply.
│ Is the destruction expected (e.g., renamed resource)?
│ ├── Yes → Proceed with explicit sign-off from infra lead.
│ └── No → Drift is unintended — someone changed infra outside Terraform.
│ Identify: compare state vs actual in cloud console.
│ Escalate: "terraform plan shows destroy of <resource>, investigation needed"
└── No → Changes only?
├── Small change (tag update, timeout tweak) → Apply with peer review.
└── Large or unclear change → Escalate. Don't apply alone at 3 AM.
Escalation trigger: Destructive changes in plan; state file locked or corrupted; changes to networking (VPC, subnets, security groups).
Safe actions: terraform plan — read-only. Never terraform apply for destructive changes alone.
Dangerous actions: terraform apply with destroys, terraform state rm, terraform force-unlock.
Alert: Cloud Resource Capacity Limit Hit¶
Severity: P1 (can't scale) / P2 (headroom < 20%)
First command:
# AWS EC2: check service quotas
aws service-quotas list-service-quotas --service-code ec2 --query 'Quotas[?UsageMetric!=null].[QuotaName,Value]'
# Kubernetes: are new nodes not joining?
kubectl get nodes
kubectl describe nodes | grep -E "Allocatable|Allocated"
Decision tree:
Is the cluster autoscaler trying but failing to add nodes?
├── Yes → kubectl logs -n kube-system -l app=cluster-autoscaler | tail -50
│ "Quota exceeded"? → Request quota increase in cloud console (takes time).
│ Short-term: can you free capacity (remove idle node groups, stop dev nodes)?
└── No → Is it a specific instance type exhausted in the AZ?
├── Yes → Can you use a different instance type or AZ?
│ Edit node group instance type in Terraform / cluster config.
└── No → General vCPU quota? Request increase AND reduce usage:
Scale down non-critical workloads if possible.
Escalate: "vCPU quota exhausted in <region>, quota increase requested, ETA: unknown"
Escalation trigger: Cannot scale to meet demand; quota request takes > 4 hours; prod service degraded due to capacity.
Safe actions: Check quotas and autoscaler logs — read-only.
Dangerous actions: Scaling down workloads, changing instance types in production, modifying autoscaler config.
Alert: Unexpected Cost Spike¶
Severity: P2 (> 2x normal) / P1 (runaway cost, > 10x)
First command:
# AWS: check Cost Explorer (console) or:
aws ce get-cost-and-usage --time-period Start=$(date -d "yesterday" +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity DAILY --metrics "UnblendedCost" --group-by Type=DIMENSION,Key=SERVICE
Decision tree:
Is it EC2 / compute cost spike?
├── Yes → Runaway autoscaling? kubectl get nodes; check min/max in autoscaler config.
│ Is there a crypto-mining or runaway workload? Check CPU on all nodes.
│ Scale down unauthorized nodes: kubectl delete node <name> (evicts pods too)
└── No → Is it data transfer / egress?
├── Yes → Large log export? Metrics shipping to wrong region? Check data pipeline.
└── No → S3 / storage spike?
List large buckets: aws s3 ls --recursive s3://<bucket> | sort -k 3 -rn | head
Unexpected objects / versions? Escalate to infra for lifecycle policy review.
Escalate: "Cost spike in <service>: $<amount> vs $<normal>, investigate needed"
Escalation trigger: Cost spike > 5x baseline; potential crypto-mining or compromised workload; cannot identify source.
Safe actions: Read cost reports, check node count, check running workloads — read-only.
Dangerous actions: Delete nodes (evicts pods), remove S3 objects, change autoscaling limits.
Alert: Terraform State Lock Stuck¶
Severity: P2 (blocking all infra changes)
First command:
What you're looking for: "Error acquiring the state lock" with a Lock ID and holder info.Decision tree:
Is the lock held by an active Terraform run?
├── Yes → Wait for it to complete. Check CI for a running terraform job.
│ If job appears stuck (> 30 min): cancel the CI job, then force-unlock.
└── No (stale lock, no active run)?
→ Get lock ID: terraform plan 2>&1 | grep "Lock Info" -A 10
→ Force unlock (WITH PEER REVIEW — destructive):
terraform force-unlock <LOCK_ID>
→ Run terraform plan again to verify state is clean.
Escalate: "State lock held by dead process, force-unlocked at <timestamp>"
Escalation trigger: Lock held by unknown process; force-unlock fails; state file appears corrupted.
Safe actions: terraform plan (read-only, shows lock info).
Dangerous actions: terraform force-unlock (can corrupt state if run while active apply), terraform state rm.
Quick Reference¶
Most Useful Commands¶
# Check cloud provider status
# AWS: https://health.aws.amazon.com/health/status
# GCP: https://status.cloud.google.com
# Terraform plan (what would change)
terraform plan
# List all nodes and their zones
kubectl get nodes -o wide
# Cluster autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50
# AWS vCPU quota check
aws service-quotas list-service-quotas --service-code ec2 \
--query 'Quotas[?contains(QuotaName,`vCPU`)].[QuotaName,Value]'
# AWS daily cost by service
aws ce get-cost-and-usage \
--time-period Start=$(date -d "yesterday" +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity DAILY --metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE
# Node resource allocation summary
kubectl describe nodes | grep -E "Name:|Allocatable:" -A 5
# Check autoscaler min/max settings
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
# Force unlock Terraform state (use carefully)
terraform force-unlock <LOCK_ID>
Escalation Contacts¶
| Situation | Team | Channel |
|---|---|---|
| Cloud provider outage | Infra lead | #infra-oncall |
| Terraform destructive change | Infra lead | #infra-oncall |
| Quota exhaustion | Infra + management | #infra-oncall |
| Cost spike > 5x | Infra lead + finance | #infra-oncall |
| Suspected compromise / crypto mining | Security + Infra | #security-incidents |
Safe vs Dangerous Actions¶
| Safe (do without asking) | Dangerous (get approval) |
|---|---|
| Check status pages | terraform apply |
| terraform plan | terraform force-unlock |
| Check node count and zones | Scale down workloads |
| Check autoscaler logs | Change autoscaling limits |
| Read cost reports | Delete cloud resources |
| Check quota usage | Failover to another region |