Skip to content

On-Call Survival: Cloud/Infrastructure

Print this. Pin it. Read it at 3 AM.


Alert: Cloud Provider Partial Outage

Severity: P1 (affecting your region/services)

First command:

# Check provider status pages:
# AWS:    https://health.aws.amazon.com/health/status
# GCP:    https://status.cloud.google.com
# Azure:  https://status.azure.com
# Check your region specifically — not all regions are affected equally.
What you're looking for: Is your region/AZ listed? Which services are affected (EC2, RDS, EKS, etc.)?

Decision tree:

Is your specific service listed on the status page?
├── Yes  This is a provider incident, not your code.
          Open an incident: "Cloud provider outage affecting <service> in <region>"
          Notify stakeholders. Do NOT attempt to fix cloud-side issues.
          Check failover options: can you serve from another region/AZ?
└── No  Provider says healthy but you're seeing issues?
    ├── AZ-specific?  Check if your resources are spread across AZs.
       kubectl get nodes -o wide  (check zones)
       If single-AZ deployment: escalate to infra for cross-AZ failover.
    └── Could be your configuration, not the provider.
         Check recent Terraform or infra changes: git log --oneline devops/terraform/ | head -5
         Escalate: "Provider reports healthy but <service> degraded in <region>: <symptoms>"

Escalation trigger: Provider outage lasting > 15 min; no available failover; SLA breach imminent.

Safe actions: Check status pages, check your own resource distribution — read-only.

Dangerous actions: Failover to another region (significant traffic impact), change DNS.


Alert: Terraform Drift / Infrastructure Mismatch

Severity: P2

First command:

cd /workspace/grokdevops/devops/terraform
terraform plan -out=tfplan 2>&1 | tail -40
What you're looking for: Plan: X to add, Y to change, Z to destroy. Changes to destroy are highest risk.

Decision tree:

Are there resources marked for destruction?
├── Yes  STOP. Do not apply.
         Is the destruction expected (e.g., renamed resource)?
         ├── Yes  Proceed with explicit sign-off from infra lead.
         └── No   Drift is unintended  someone changed infra outside Terraform.
                   Identify: compare state vs actual in cloud console.
                   Escalate: "terraform plan shows destroy of <resource>, investigation needed"
└── No  Changes only?
    ├── Small change (tag update, timeout tweak)  Apply with peer review.
    └── Large or unclear change  Escalate. Don't apply alone at 3 AM.

Escalation trigger: Destructive changes in plan; state file locked or corrupted; changes to networking (VPC, subnets, security groups).

Safe actions: terraform plan — read-only. Never terraform apply for destructive changes alone.

Dangerous actions: terraform apply with destroys, terraform state rm, terraform force-unlock.


Alert: Cloud Resource Capacity Limit Hit

Severity: P1 (can't scale) / P2 (headroom < 20%)

First command:

# AWS EC2: check service quotas
aws service-quotas list-service-quotas --service-code ec2 --query 'Quotas[?UsageMetric!=null].[QuotaName,Value]'
# Kubernetes: are new nodes not joining?
kubectl get nodes
kubectl describe nodes | grep -E "Allocatable|Allocated"
What you're looking for: Which quota is exhausted (vCPUs, instances, EIPs, etc.); whether cluster autoscaler is blocked.

Decision tree:

Is the cluster autoscaler trying but failing to add nodes?
├── Yes  kubectl logs -n kube-system -l app=cluster-autoscaler | tail -50
         "Quota exceeded"?  Request quota increase in cloud console (takes time).
         Short-term: can you free capacity (remove idle node groups, stop dev nodes)?
└── No  Is it a specific instance type exhausted in the AZ?
    ├── Yes  Can you use a different instance type or AZ?
             Edit node group instance type in Terraform / cluster config.
    └── No  General vCPU quota? Request increase AND reduce usage:
             Scale down non-critical workloads if possible.
             Escalate: "vCPU quota exhausted in <region>, quota increase requested, ETA: unknown"

Escalation trigger: Cannot scale to meet demand; quota request takes > 4 hours; prod service degraded due to capacity.

Safe actions: Check quotas and autoscaler logs — read-only.

Dangerous actions: Scaling down workloads, changing instance types in production, modifying autoscaler config.


Alert: Unexpected Cost Spike

Severity: P2 (> 2x normal) / P1 (runaway cost, > 10x)

First command:

# AWS: check Cost Explorer (console) or:
aws ce get-cost-and-usage --time-period Start=$(date -d "yesterday" +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY --metrics "UnblendedCost" --group-by Type=DIMENSION,Key=SERVICE
What you're looking for: Which service has the cost spike (EC2, data transfer, S3, RDS).

Decision tree:

Is it EC2 / compute cost spike?
├── Yes  Runaway autoscaling? kubectl get nodes; check min/max in autoscaler config.
         Is there a crypto-mining or runaway workload? Check CPU on all nodes.
         Scale down unauthorized nodes: kubectl delete node <name> (evicts pods too)
└── No  Is it data transfer / egress?
    ├── Yes  Large log export? Metrics shipping to wrong region? Check data pipeline.
    └── No  S3 / storage spike?
             List large buckets: aws s3 ls --recursive s3://<bucket> | sort -k 3 -rn | head
             Unexpected objects / versions? Escalate to infra for lifecycle policy review.
             Escalate: "Cost spike in <service>: $<amount> vs $<normal>, investigate needed"

Escalation trigger: Cost spike > 5x baseline; potential crypto-mining or compromised workload; cannot identify source.

Safe actions: Read cost reports, check node count, check running workloads — read-only.

Dangerous actions: Delete nodes (evicts pods), remove S3 objects, change autoscaling limits.


Alert: Terraform State Lock Stuck

Severity: P2 (blocking all infra changes)

First command:

cd /workspace/grokdevops/devops/terraform/modules/<module>
terraform plan 2>&1 | grep -i lock
What you're looking for: "Error acquiring the state lock" with a Lock ID and holder info.

Decision tree:

Is the lock held by an active Terraform run?
├── Yes  Wait for it to complete. Check CI for a running terraform job.
         If job appears stuck (> 30 min): cancel the CI job, then force-unlock.
└── No (stale lock, no active run)?
     Get lock ID: terraform plan 2>&1 | grep "Lock Info" -A 10
     Force unlock (WITH PEER REVIEW  destructive):
       terraform force-unlock <LOCK_ID>
     Run terraform plan again to verify state is clean.
    Escalate: "State lock held by dead process, force-unlocked at <timestamp>"

Escalation trigger: Lock held by unknown process; force-unlock fails; state file appears corrupted.

Safe actions: terraform plan (read-only, shows lock info).

Dangerous actions: terraform force-unlock (can corrupt state if run while active apply), terraform state rm.


Quick Reference

Most Useful Commands

# Check cloud provider status
# AWS: https://health.aws.amazon.com/health/status
# GCP: https://status.cloud.google.com

# Terraform plan (what would change)
terraform plan

# List all nodes and their zones
kubectl get nodes -o wide

# Cluster autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50

# AWS vCPU quota check
aws service-quotas list-service-quotas --service-code ec2 \
  --query 'Quotas[?contains(QuotaName,`vCPU`)].[QuotaName,Value]'

# AWS daily cost by service
aws ce get-cost-and-usage \
  --time-period Start=$(date -d "yesterday" +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE

# Node resource allocation summary
kubectl describe nodes | grep -E "Name:|Allocatable:" -A 5

# Check autoscaler min/max settings
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# Force unlock Terraform state (use carefully)
terraform force-unlock <LOCK_ID>

Escalation Contacts

Situation Team Channel
Cloud provider outage Infra lead #infra-oncall
Terraform destructive change Infra lead #infra-oncall
Quota exhaustion Infra + management #infra-oncall
Cost spike > 5x Infra lead + finance #infra-oncall
Suspected compromise / crypto mining Security + Infra #security-incidents

Safe vs Dangerous Actions

Safe (do without asking) Dangerous (get approval)
Check status pages terraform apply
terraform plan terraform force-unlock
Check node count and zones Scale down workloads
Check autoscaler logs Change autoscaling limits
Read cost reports Delete cloud resources
Check quota usage Failover to another region

Shift Handoff Template

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]