Portal | Level: L2: Operations | Topics: FinOps | Domain: DevOps & Tooling
Scenario: Cloud Cost Spike Investigation¶
The Prompt¶
"Our AWS bill jumped 40% this month compared to last month. Engineering says nothing has changed. Finance wants an explanation and a fix by end of week. Where do you start?"
Initial Report¶
Finance email: "AWS bill went from $50K to $70K this month. The increase seems to be in EC2 and EBS. No new services were launched that I know of."
Constraints¶
- Not an outage: No time pressure, but management wants answers fast.
- Political: Teams may be defensive about their resource usage.
- Need data: Must prove the cause with evidence, not guesswork.
Expected Investigation Path¶
# 1. AWS Cost Explorer — identify the service and time
# Filter by service: EC2, EBS, EKS
# Group by: tag (team/service), linked account, usage type
# Time: daily granularity to find when the spike started
# 2. Check for new/larger instances
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity DAILY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=USAGE_TYPE
# 3. Check Kubernetes node counts over time
kubectl get nodes --show-labels | wc -l
# Compare with last month's baseline
# 4. Check if cluster autoscaler / Karpenter scaled up
kubectl get events --field-selector reason=ScaledUp -A
# 5. Check for resource request inflation
kubectl top nodes
kubectl resource-capacity --util --sort cpu.request
# 6. Check for orphaned resources
# Unattached EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available \
--query 'Volumes[].{ID:VolumeId,Size:Size,Created:CreateTime}'
# Unused load balancers
aws elbv2 describe-load-balancers --query 'LoadBalancers[].LoadBalancerArn'
# Cross-reference with active services
# 7. Check for spot instance fallback to on-demand
# Karpenter logs or ASG activity showing capacity type changes
Common Root Causes¶
- Cluster autoscaler added nodes — a service increased resource requests without right-sizing
- Spot instances fell back to on-demand — spot capacity unavailable in the region
- Orphaned EBS volumes — PVCs deleted but volumes retained (reclaimPolicy: Retain)
- New environment spun up — someone created a staging cluster and forgot about it
- Log/metric volume explosion — CloudWatch or S3 costs from increased logging
- Data transfer — cross-AZ or cross-region traffic increase
What a Strong Answer Includes¶
- Structured investigation: start with Cost Explorer, then drill down by service/tag/time
- Multiple hypotheses: don't jump to conclusions
- Tagging importance: "If resources aren't tagged, this investigation takes 10x longer"
- Immediate savings: identify and clean up orphaned resources
- Prevention: implement cost alerts, enforce tagging policies, regular FinOps reviews
- Dashboards: set up Kubecost or OpenCost for Kubernetes-specific cost visibility
Wiki Navigation¶
Related Content¶
- FinOps & Cost Optimization (Topic Pack, L2) — FinOps
- FinOps Drills (Drill, L2) — FinOps
- Finops Flashcards (CLI) (flashcard_deck, L1) — FinOps
- Skillcheck: FinOps (Assessment, L2) — FinOps
Pages that link here¶
- Cost Optimization & FinOps - Primer
- FinOps & Cost Optimization
- FinOps & Cost Optimization Drills
- FinOps / Cost Optimization - Skill Check
- Interview Gauntlet: Kubernetes or Simpler Orchestrator?
- Interview Gauntlet: Log Aggregation Pipeline
- Interview Gauntlet: Managed Database or Self-Hosted?
- Interview Gauntlet: Multi-Region Kubernetes Deployment
- Interview Scenarios
- Level 7: SRE & Cloud Operations
- Master Curriculum: 40 Weeks
- Track: Cloud & FinOps