FinOps Footguns¶

Mistakes that blow up your cloud bill, waste resources, or create cost surprises that hit at the worst time.

1. Reserved instances for the wrong type¶

You commit to a 3-year Reserved Instance for m5.xlarge in us-east-1a. Then you migrate to Graviton (m6g.xlarge) for better performance. Your RI is stuck — non-convertible, non-refundable. You're paying for an instance type you no longer use.

Fix: Use Savings Plans instead of RIs — they're more flexible. If using RIs, start with 1-year convertible. Review RI utilization monthly.

War story: Teams routinely buy 3-year non-convertible RIs to maximize savings, then migrate to Graviton, change regions, or downsize within a year. The unused RI burns cash for the remaining term. Rule of thumb: never buy commitments first. Right-size and optimize, observe stable usage for 30-60 days, then commit to what remains.

2. Auto-scaling max set too high¶

Your HPA maxReplicas: 100 seems safe. A traffic spike or a bug triggers scaling. 100 large pods spin up across new nodes. The cluster autoscaler adds 20 nodes. Your cloud bill spikes $10,000 in a day.

Fix: Set realistic maxReplicas based on budget, not just capacity. Add cost alerts. Use cluster autoscaler maxNodeCount as a safety net. Review scaling events daily.

3. NAT Gateway data processing charges¶

Your pods download 500GB of data per day through the NAT Gateway. At $0.045/GB, that's $22.50/day or $675/month just for NAT data processing — before the actual data transfer costs.

Fix: Use VPC endpoints for AWS services (S3, DynamoDB, ECR) to bypass NAT. Pull images from ECR through the VPC endpoint. Monitor NAT Gateway bytes processed.

Gotcha: S3 traffic goes through NAT Gateway by default even though S3 is in the same region. A Gateway VPC Endpoint for S3 is free and creates a private route that bypasses NAT entirely. One team documented $907 in a single day from 20TB of S3 traffic through NAT Gateway at $0.045/GB. The VPC endpoint fix took 5 minutes and saved $1,000+/month.

4. Cross-AZ data transfer adding up silently¶

Your API server in AZ-a talks to the database in AZ-b. Every request crosses AZ boundaries at $0.01/GB each way. With 100GB/day of cross-AZ traffic, that's $60/month you didn't budget for.

Fix: Use topology-aware routing to prefer same-AZ communication. Place tightly-coupled services in the same AZ. Monitor cross-AZ data transfer in Cost Explorer.

5. Forgotten dev/test environments¶

Your team spins up a "temporary" staging environment with 10 nodes. The project ends. Nobody tears it down. It runs for 8 months at $3,000/month. That's $24,000 nobody noticed.

Fix: Tag all environments with owner and expiry. Schedule automatic shutdown for non-production after business hours. Run weekly orphaned resource reports. Set budget alerts per team.

Remember: The "forgotten environment" trifecta: no owner tag, no expiry tag, no budget alert. Enforce tagging with SCPs or Azure Policy. AWS: aws:RequestTag/owner condition key in IAM policies rejects untagged resource creation. Automated shutdown of dev/test environments outside business hours can cut non-prod costs by 65%.

6. Oversized database instances¶

You provision db.r5.4xlarge for production RDS "just in case." Average CPU is 5%, average connections is 12 out of 1000. You're paying $2,000/month for a database that could run on a $200/month instance.

Fix: Right-size based on actual metrics, not anticipated load. Monitor CPU, memory, connections, and IOPS for 2 weeks before choosing the instance type. Use AWS Compute Optimizer recommendations.

7. EBS snapshots accumulating forever¶

You take daily EBS snapshots. Your retention policy keeps all of them. After 2 years, you have 730 snapshots per volume. Each incremental snapshot is small, but the total adds up to terabytes at $0.05/GB/month.

Fix: Set retention policies (keep 7 daily, 4 weekly, 12 monthly). Use AWS Backup with lifecycle policies. Monitor snapshot costs in Cost Explorer.

8. Requests and limits wildly mismatched¶

You set requests: cpu: 2 and limits: cpu: 8 on all pods. The scheduler reserves 2 CPUs per pod but the pod can burst to 8. With 10 pods, the scheduler thinks it needs 20 CPUs but actual usage could be 80. You over-provision nodes to handle the burst that rarely happens.

Fix: Set requests close to actual average usage. Set limits at peak usage. Monitor container_cpu_usage_seconds_total vs requests. Use VPA in recommendation mode.

9. Logging everything at DEBUG level in production¶

Your application logs every request body, every SQL query, and every cache lookup at DEBUG level. CloudWatch ingestion charges: $0.50/GB. You're generating 100GB/day of logs. That's $1,500/month just for log ingestion.

Fix: Use INFO level in production. Log DEBUG only when actively debugging. Set log retention policies (7-30 days for most logs). Sample high-volume logs.

10. Spot instance interruption without graceful handling¶

You run batch jobs on spot instances to save 70%. The instances get terminated with 2 minutes notice. Your job loses 3 hours of progress and restarts from scratch, consuming more total compute hours than on-demand would have.

Fix: Use checkpointing for long-running jobs. Handle the spot termination notice (check the instance metadata endpoint). Design jobs to be resumable, not restartable.

Under the hood: AWS sends a spot termination notice via instance metadata (http://169.254.169.254/latest/meta-data/spot/instance-action) exactly 2 minutes before termination. Your job must checkpoint and gracefully exit in under 2 minutes. If you cannot checkpoint that fast, use Spot + On-Demand mixed groups so critical work finishes on on-demand capacity.