On-Call Survival: Cloud/Infrastructure¶

Print this. Pin it. Read it at 3 AM.

Alert: Cloud Provider Partial Outage¶

Severity: P1 (affecting your region/services)

First command:

# Check provider status pages:
# AWS:    https://health.aws.amazon.com/health/status
# GCP:    https://status.cloud.google.com
# Azure:  https://status.azure.com
# Check your region specifically — not all regions are affected equally.

What you're looking for: Is your region/AZ listed? Which services are affected (EC2, RDS, EKS, etc.)?

Decision tree:

Is your specific service listed on the status page?
├── Yes → This is a provider incident, not your code.
│         → Open an incident: "Cloud provider outage affecting <service> in <region>"
│         → Notify stakeholders. Do NOT attempt to fix cloud-side issues.
│         → Check failover options: can you serve from another region/AZ?
└── No → Provider says healthy but you're seeing issues?
    ├── AZ-specific? → Check if your resources are spread across AZs.
    │   kubectl get nodes -o wide  (check zones)
    │   If single-AZ deployment: escalate to infra for cross-AZ failover.
    └── Could be your configuration, not the provider.
        → Check recent Terraform or infra changes: git log --oneline devops/terraform/ | head -5
        → Escalate: "Provider reports healthy but <service> degraded in <region>: <symptoms>"

Escalation trigger: Provider outage lasting > 15 min; no available failover; SLA breach imminent.

Safe actions: Check status pages, check your own resource distribution — read-only.

Dangerous actions: Failover to another region (significant traffic impact), change DNS.

Alert: Terraform Drift / Infrastructure Mismatch¶

Severity: P2

First command:

cd /workspace/grokdevops/devops/terraform
terraform plan -out=tfplan 2>&1 | tail -40

What you're looking for: Plan: X to add, Y to change, Z to destroy. Changes to destroy are highest risk.

Decision tree:

Are there resources marked for destruction?
├── Yes → STOP. Do not apply.
│         Is the destruction expected (e.g., renamed resource)?
│         ├── Yes → Proceed with explicit sign-off from infra lead.
│         └── No  → Drift is unintended — someone changed infra outside Terraform.
│                   Identify: compare state vs actual in cloud console.
│                   Escalate: "terraform plan shows destroy of <resource>, investigation needed"
└── No → Changes only?
    ├── Small change (tag update, timeout tweak) → Apply with peer review.
    └── Large or unclear change → Escalate. Don't apply alone at 3 AM.

Escalation trigger: Destructive changes in plan; state file locked or corrupted; changes to networking (VPC, subnets, security groups).

Safe actions: terraform plan — read-only. Never terraform apply for destructive changes alone.

Dangerous actions: terraform apply with destroys, terraform state rm, terraform force-unlock.

Alert: Cloud Resource Capacity Limit Hit¶

Severity: P1 (can't scale) / P2 (headroom < 20%)

First command:

# AWS EC2: check service quotas
aws service-quotas list-service-quotas --service-code ec2 --query 'Quotas[?UsageMetric!=null].[QuotaName,Value]'
# Kubernetes: are new nodes not joining?
kubectl get nodes
kubectl describe nodes | grep -E "Allocatable|Allocated"

What you're looking for: Which quota is exhausted (vCPUs, instances, EIPs, etc.); whether cluster autoscaler is blocked.

Decision tree:

Is the cluster autoscaler trying but failing to add nodes?
├── Yes → kubectl logs -n kube-system -l app=cluster-autoscaler | tail -50
│         "Quota exceeded"? → Request quota increase in cloud console (takes time).
│         Short-term: can you free capacity (remove idle node groups, stop dev nodes)?
└── No → Is it a specific instance type exhausted in the AZ?
    ├── Yes → Can you use a different instance type or AZ?
    │         Edit node group instance type in Terraform / cluster config.
    └── No → General vCPU quota? Request increase AND reduce usage:
             Scale down non-critical workloads if possible.
             Escalate: "vCPU quota exhausted in <region>, quota increase requested, ETA: unknown"

Escalation trigger: Cannot scale to meet demand; quota request takes > 4 hours; prod service degraded due to capacity.

Safe actions: Check quotas and autoscaler logs — read-only.

Dangerous actions: Scaling down workloads, changing instance types in production, modifying autoscaler config.

Alert: Unexpected Cost Spike¶

Severity: P2 (> 2x normal) / P1 (runaway cost, > 10x)

First command:

# AWS: check Cost Explorer (console) or:
aws ce get-cost-and-usage --time-period Start=$(date -d "yesterday" +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY --metrics "UnblendedCost" --group-by Type=DIMENSION,Key=SERVICE

What you're looking for: Which service has the cost spike (EC2, data transfer, S3, RDS).

Decision tree:

Is it EC2 / compute cost spike?
├── Yes → Runaway autoscaling? kubectl get nodes; check min/max in autoscaler config.
│         Is there a crypto-mining or runaway workload? Check CPU on all nodes.
│         Scale down unauthorized nodes: kubectl delete node <name> (evicts pods too)
└── No → Is it data transfer / egress?
    ├── Yes → Large log export? Metrics shipping to wrong region? Check data pipeline.
    └── No → S3 / storage spike?
             List large buckets: aws s3 ls --recursive s3://<bucket> | sort -k 3 -rn | head
             Unexpected objects / versions? Escalate to infra for lifecycle policy review.
             Escalate: "Cost spike in <service>: $<amount> vs $<normal>, investigate needed"

Escalation trigger: Cost spike > 5x baseline; potential crypto-mining or compromised workload; cannot identify source.

Safe actions: Read cost reports, check node count, check running workloads — read-only.

Dangerous actions: Delete nodes (evicts pods), remove S3 objects, change autoscaling limits.

Alert: Terraform State Lock Stuck¶

Severity: P2 (blocking all infra changes)

First command:

cd /workspace/grokdevops/devops/terraform/modules/<module>
terraform plan 2>&1 | grep -i lock

What you're looking for: "Error acquiring the state lock" with a Lock ID and holder info.

Decision tree:

Is the lock held by an active Terraform run?
├── Yes → Wait for it to complete. Check CI for a running terraform job.
│         If job appears stuck (> 30 min): cancel the CI job, then force-unlock.
└── No (stale lock, no active run)?
    → Get lock ID: terraform plan 2>&1 | grep "Lock Info" -A 10
    → Force unlock (WITH PEER REVIEW — destructive):
       terraform force-unlock <LOCK_ID>
    → Run terraform plan again to verify state is clean.
    Escalate: "State lock held by dead process, force-unlocked at <timestamp>"

Escalation trigger: Lock held by unknown process; force-unlock fails; state file appears corrupted.

Safe actions: terraform plan (read-only, shows lock info).

Dangerous actions: terraform force-unlock (can corrupt state if run while active apply), terraform state rm.

Quick Reference¶

Most Useful Commands¶

# Check cloud provider status
# AWS: https://health.aws.amazon.com/health/status
# GCP: https://status.cloud.google.com

# Terraform plan (what would change)
terraform plan

# List all nodes and their zones
kubectl get nodes -o wide

# Cluster autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50

# AWS vCPU quota check
aws service-quotas list-service-quotas --service-code ec2 \
  --query 'Quotas[?contains(QuotaName,`vCPU`)].[QuotaName,Value]'

# AWS daily cost by service
aws ce get-cost-and-usage \
  --time-period Start=$(date -d "yesterday" +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE

# Node resource allocation summary
kubectl describe nodes | grep -E "Name:|Allocatable:" -A 5

# Check autoscaler min/max settings
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# Force unlock Terraform state (use carefully)
terraform force-unlock <LOCK_ID>

Escalation Contacts¶

Situation	Team	Channel
Cloud provider outage	Infra lead	#infra-oncall
Terraform destructive change	Infra lead	#infra-oncall
Quota exhaustion	Infra + management	#infra-oncall
Cost spike > 5x	Infra lead + finance	#infra-oncall
Suspected compromise / crypto mining	Security + Infra	#security-incidents

Safe vs Dangerous Actions¶

Safe (do without asking)	Dangerous (get approval)
Check status pages	terraform apply
terraform plan	terraform force-unlock
Check node count and zones	Scale down workloads
Check autoscaler logs	Change autoscaling limits
Read cost reports	Delete cloud resources
Check quota usage	Failover to another region

Shift Handoff Template¶

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]