The Cloud Bill Surprise

lesson
finops
cost-allocation
reserved-instances
spot
right-sizing
cost-optimization ---# The Cloud Bill Surprise

Topics: FinOps, cost allocation, reserved instances, spot, right-sizing, cost optimization Level: L1–L2 (Foundations → Operations) Time: 45–60 minutes Prerequisites: Basic cloud familiarity helpful

The Mission¶

The finance team forwards you an AWS bill: $47,000 this month. Last month was $12,000. "Can you explain this?" No deploys. No traffic spikes. No new services. Somehow your cloud spend quadrupled.

Cloud billing is where engineering meets finance, and most engineers have never been taught to think about it. This lesson teaches you where cloud money goes, how to find waste, and how to prevent bill surprises.

Where Cloud Money Goes¶

Most cloud bills follow the same pattern:

Typical AWS bill breakdown:
  40% — Compute (EC2, EKS nodes, Lambda)
  25% — Databases (RDS, ElastiCache, DynamoDB)
  15% — Storage (S3, EBS, snapshots)
  10% — Network (data transfer, NAT Gateway, Load Balancers)
   5% — Everything else (CloudWatch, Secrets Manager, etc.)
   5% — Forgotten resources nobody knows about

The last 5% is the most dangerous — resources that are running, costing money, and serving no purpose.

The Seven Usual Suspects¶

1. Forgotten resources¶

# Unattached EBS volumes (paying for disk nobody uses)
aws ec2 describe-volumes --filters "Name=status,Values=available" \
    --query 'Volumes[].[VolumeId,Size,CreateTime]' --output table

# Old snapshots (often the biggest surprise)
aws ec2 describe-snapshots --owner-ids self \
    --query 'Snapshots[?StartTime<`2025-01-01`].[SnapshotId,VolumeSize,StartTime]' \
    --output table

# Stopped instances still paying for EBS
aws ec2 describe-instances --filters "Name=instance-state-name,Values=stopped" \
    --query 'Reservations[].Instances[].[InstanceId,InstanceType,LaunchTime]' --output table

# Unused Elastic IPs (charged when NOT attached to a running instance)
aws ec2 describe-addresses --query 'Addresses[?AssociationId==null].[PublicIp,AllocationId]'

Gotcha: EBS snapshots are incremental, but each snapshot's billing is based on its total consumed storage — not just the delta. A 500GB volume with daily snapshots for a year can accumulate terabytes of snapshot storage. Nobody looks at snapshot costs until the bill arrives.

2. Oversized instances¶

# Check CPU utilization across all instances
aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-0abc123 \
    --start-time 2026-03-15T00:00:00Z --end-time 2026-03-22T00:00:00Z \
    --period 86400 --statistics Average

# If average CPU is <10%, the instance is probably oversized

A m5.4xlarge ($0.768/hr) running at 5% CPU could be a t3.medium ($0.0416/hr) — 18x cheaper. Multiply by 50 instances, and you've found $25,000/month in savings.

3. NAT Gateway data processing¶

NAT Gateway charges $0.045/GB of data processed. A service making frequent API calls to the internet (pulling container images, calling external APIs, shipping logs) through NAT can generate surprising bills.

100 nodes × 10GB/day each through NAT = 1,000 GB/day
1,000 GB × $0.045 × 30 days = $1,350/month just for NAT

Fix: Use VPC endpoints for AWS services (S3, ECR, DynamoDB). Pull container images from ECR within the VPC. Ship logs via VPC endpoints to CloudWatch.

4. Cross-AZ data transfer¶

Traffic between Availability Zones costs $0.01/GB in each direction. Microservices calling each other across AZs generate this charge on every request.

Service A (us-east-1a) → Service B (us-east-1b)
1 KB request × 1,000 req/sec × 86,400 sec × 30 days = ~2.6 TB/month
2.6 TB × $0.01 × 2 (both directions) = $52/month for ONE service pair
20 service pairs = $1,040/month

Fix: Use topology-aware routing in Kubernetes to prefer same-AZ traffic. Or accept the cost as the price of AZ redundancy.

5. Dev/staging environments running 24/7¶

Production needs to run 24/7. Dev and staging don't. But they usually do.

Dev cluster: 3 × m5.xlarge ($0.192/hr × 3 × 24 × 30) = $414/month
                Only used Mon-Fri 9am-6pm = 45 hours/week out of 168
                Potential savings: 73% → $302/month saved

Fix: Auto-shutdown with Lambda or scheduled scaling. Many teams use a "start on PR, stop after merge" pattern for ephemeral environments.

6. Unoptimized S3 storage classes¶

S3 Standard costs $0.023/GB/month. S3 Glacier costs $0.004/GB/month — 5.75x cheaper. If you have 10TB of logs older than 90 days sitting in Standard:

10TB in Standard: $230/month
10TB in Glacier:  $40/month
Savings: $190/month ($2,280/year)

Fix: S3 Lifecycle policies automatically transition objects between storage classes.

7. The autoscaler that never scaled down¶

War Story: A team configured Kubernetes Horizontal Pod Autoscaler to scale up when CPU exceeded 50%. During a traffic spike, it scaled from 5 to 40 pods. The scale-down policy had a stabilizationWindowSeconds of 300 (5 minutes). But the CPU metric had noise — every 4 minutes, CPU briefly touched 51%. The HPA never saw 5 continuous minutes below threshold. 40 pods ran for 3 weeks before anyone noticed the AWS bill had tripled.

Reserved Instances and Savings Plans¶

On-demand pricing is the "rack rate" — full price. For steady workloads, commit to 1 or 3 years for 30-60% savings.

Commitment	Discount	Risk
On-Demand	0%	None (pay per hour)
1-year RI (no upfront)	~20-30%	Committed to instance type/region for 1 year
1-year RI (all upfront)	~30-40%	Cash locked up for 1 year
3-year RI (all upfront)	~50-60%	Cash locked up for 3 years
Savings Plan	~20-40%	Committed to $/hr spend, flexible instance types
Spot instances	60-90%	Can be interrupted with 2-minute warning

Mental Model: Reserved Instances are like a gym membership. You pay less per visit if you commit to a year. But if you stop going to the gym, you're still paying. Only reserve what you're confident you'll use for the full term.

Savings Plans (AWS) are more flexible: you commit to a dollar amount per hour (e.g., "I'll spend at least $10/hr on compute") and get the discount on any instance type in any region. Better for teams that change instance types frequently.

Building Cost Visibility¶

The first step to cost control: know what's spending money.

Tagging (non-negotiable)¶

# Terraform: every resource gets cost allocation tags
resource "aws_instance" "web" {
  # ...
  tags = {
    Name        = "web-prod-01"
    Environment = "production"
    Team        = "platform"
    Service     = "api"
    CostCenter  = "engineering"
  }
}

Without tags, your bill is one number. With tags, you can see: - How much each team spends - How much each environment costs - Which service is most expensive

Cost anomaly alerts¶

# AWS Cost Anomaly Detection (managed service)
# Or DIY: compare today's spend to 7-day average
aws ce get-cost-and-usage \
    --time-period Start=2026-03-22,End=2026-03-23 \
    --granularity DAILY \
    --metrics BlendedCost \
    --group-by Type=TAG,Key=Service

Set an alert when daily spend exceeds 150% of the 7-day average. This catches runaway resources within 24 hours instead of waiting for the monthly bill.

Flashcard Check¶

Q1: AWS bill jumped 4x with no deploys. What do you check first?

Forgotten resources (unattached EBS, old snapshots, stopped instances with volumes). Then oversized instances, NAT gateway charges, and cross-AZ data transfer.

Q2: Reserved Instance vs Savings Plan — which is more flexible?

Savings Plan. You commit to $/hr spend but can use any instance type in any region. RIs lock you to specific instance type and region.

Q3: NAT Gateway costs $0.045/GB. How do you reduce this?

VPC endpoints for AWS services (S3, ECR, CloudWatch). Pull container images via ECR within the VPC. Ship logs via endpoints, not through NAT.

Q4: Dev environment runs 24/7 but is only used Mon-Fri 9-6. Savings?

~73% savings by shutting down outside business hours. Use scheduled scaling or ephemeral environments triggered by PRs.

Q5: 10TB of 90-day-old logs in S3 Standard. How much can you save?

Move to Glacier: $230/month → $40/month (~83% savings). Use S3 Lifecycle policies to automate the transition.

Cheat Sheet¶

Find Waste¶

Waste type	AWS CLI check
Unattached EBS	`aws ec2 describe-volumes --filters Name=status,Values=available`
Old snapshots	`aws ec2 describe-snapshots --owner-ids self --query 'Snapshots[?StartTime<\`DATE`]'`
Stopped instances	`aws ec2 describe-instances --filters Name=instance-state-name,Values=stopped`
Unused EIPs	`aws ec2 describe-addresses --query 'Addresses[?AssociationId==null]'`
Low-CPU instances	CloudWatch CPUUtilization average over 7 days

Cost Optimization Priorities¶

Delete forgotten resources (immediate, free)
Right-size instances (weeks, significant savings)
Reserved/Savings Plans (commit, 30-60% savings)
S3 Lifecycle policies (set and forget)
VPC endpoints (reduce NAT costs)
Scheduled environments (dev/staging off-hours)
Spot instances (for fault-tolerant workloads)

Takeaways¶

Forgotten resources are the #1 waste. Unattached volumes, old snapshots, stopped instances with EBS. They're invisible until the bill arrives.
Right-sizing saves 50-80%. A server at 5% CPU is 18x oversized. Check CloudWatch before choosing instance types.
Network costs sneak up. NAT Gateway and cross-AZ data transfer are per-GB charges that grow with traffic. VPC endpoints eliminate most NAT costs.
Tag everything. Without tags, you can't allocate costs to teams or services. Make tags mandatory in Terraform and CI pipelines.
Alert on cost anomalies daily. Don't wait for the monthly bill. A daily spend alert at 150% of average catches problems within 24 hours.

The Terraform State Disaster — when IaC creates resources you lose track of
Deploy a Web App From Nothing — understanding what each infrastructure layer costs