Skip to content

Anti-Primer: Cloud Ops Basics

Everything that can go wrong, will — and in this story, it does.

The Setup

A team is managing Cloud Ops Basics workloads across multiple environments. The cloud bill has been growing 20% month-over-month, and management wants answers. An engineer is tasked with optimizing costs while simultaneously handling a migration project.

The Timeline

Hour 0: No Cost Alerts

Operates without billing alerts; nobody notices the cost spike until the monthly invoice arrives. The deadline was looming, and this seemed like the fastest path forward. But the result is a misconfigured autoscaler ran 200 instances for 3 weeks; the bill is 10x the normal amount.

Footgun #1: No Cost Alerts — operates without billing alerts; nobody notices the cost spike until the monthly invoice arrives, leading to a misconfigured autoscaler ran 200 instances for 3 weeks; the bill is 10x the normal amount.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Orphaned Resources

Deletes a service but leaves behind its load balancers, storage volumes, and snapshots. Under time pressure, the team chose speed over caution. But the result is orphaned resources cost $3,000/month; nobody knows what they are for or if they can be deleted.

Footgun #2: Orphaned Resources — deletes a service but leaves behind its load balancers, storage volumes, and snapshots, leading to orphaned resources cost $3,000/month; nobody knows what they are for or if they can be deleted.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Wrong Region Deployment

Deploys resources in a region far from the users for cost savings. Nobody pushed back because the shortcut looked harmless in the moment. But the result is latency increases by 200ms; users experience slow page loads; engagement metrics drop 15%.

Footgun #3: Wrong Region Deployment — deploys resources in a region far from the users for cost savings, leading to latency increases by 200ms; users experience slow page loads; engagement metrics drop 15%.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Manual Console Changes

Makes infrastructure changes through the cloud console instead of IaC. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is Terraform drift grows; next apply overwrites manual changes; service goes down.

Footgun #4: Manual Console Changes — makes infrastructure changes through the cloud console instead of IaC, leading to Terraform drift grows; next apply overwrites manual changes; service goes down.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 No Cost Alerts A misconfigured autoscaler ran 200 instances for 3 weeks; the bill is 10x the normal amount Primer: Set billing alerts at 50%, 80%, and 100% of the expected monthly budget
2 Orphaned Resources Orphaned resources cost $3,000/month; nobody knows what they are for or if they can be deleted Primer: Tag all resources; run regular orphan detection scans; use IaC to manage lifecycle
3 Wrong Region Deployment Latency increases by 200ms; users experience slow page loads; engagement metrics drop 15% Primer: Deploy in regions close to users; use CDN for static content; measure latency before and after
4 Manual Console Changes Terraform drift grows; next apply overwrites manual changes; service goes down Primer: All changes through IaC; read-only console access for non-emergency situations

Damage Report

  • Downtime: 2-4 hours of degraded or unavailable service
  • Data loss: Potential, depending on the failure mode and backup state
  • Customer impact: Visible errors, degraded performance, or complete outage for affected users
  • Engineering time to remediate: 8-16 engineer-hours across incident response and follow-up
  • Reputation cost: Internal trust erosion; possible external customer-facing apology

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on no cost alerts, they would have learned: Set billing alerts at 50%, 80%, and 100% of the expected monthly budget.
  • Footgun #2: If the engineer had read the primer, section on orphaned resources, they would have learned: Tag all resources; run regular orphan detection scans; use IaC to manage lifecycle.
  • Footgun #3: If the engineer had read the primer, section on wrong region deployment, they would have learned: Deploy in regions close to users; use CDN for static content; measure latency before and after.
  • Footgun #4: If the engineer had read the primer, section on manual console changes, they would have learned: All changes through IaC; read-only console access for non-emergency situations.

Cross-References