Portal | Level: L2: Operations | Topics: FinOps | Domain: DevOps & Tooling

Scenario: Cloud Cost Spike Investigation¶

The Prompt¶

"Our AWS bill jumped 40% this month compared to last month. Engineering says nothing has changed. Finance wants an explanation and a fix by end of week. Where do you start?"

Initial Report¶

Finance email: "AWS bill went from $50K to $70K this month. The increase seems to be in EC2 and EBS. No new services were launched that I know of."

Constraints¶

Not an outage: No time pressure, but management wants answers fast.
Political: Teams may be defensive about their resource usage.
Need data: Must prove the cause with evidence, not guesswork.

Expected Investigation Path¶

# 1. AWS Cost Explorer — identify the service and time
# Filter by service: EC2, EBS, EKS
# Group by: tag (team/service), linked account, usage type
# Time: daily granularity to find when the spike started

# 2. Check for new/larger instances
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity DAILY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=USAGE_TYPE

# 3. Check Kubernetes node counts over time
kubectl get nodes --show-labels | wc -l
# Compare with last month's baseline

# 4. Check if cluster autoscaler / Karpenter scaled up
kubectl get events --field-selector reason=ScaledUp -A

# 5. Check for resource request inflation
kubectl top nodes
kubectl resource-capacity --util --sort cpu.request

# 6. Check for orphaned resources
# Unattached EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available \
  --query 'Volumes[].{ID:VolumeId,Size:Size,Created:CreateTime}'

# Unused load balancers
aws elbv2 describe-load-balancers --query 'LoadBalancers[].LoadBalancerArn'
# Cross-reference with active services

# 7. Check for spot instance fallback to on-demand
# Karpenter logs or ASG activity showing capacity type changes

Common Root Causes¶

Cluster autoscaler added nodes — a service increased resource requests without right-sizing
Spot instances fell back to on-demand — spot capacity unavailable in the region
Orphaned EBS volumes — PVCs deleted but volumes retained (reclaimPolicy: Retain)
New environment spun up — someone created a staging cluster and forgot about it
Log/metric volume explosion — CloudWatch or S3 costs from increased logging
Data transfer — cross-AZ or cross-region traffic increase

What a Strong Answer Includes¶

Structured investigation: start with Cost Explorer, then drill down by service/tag/time
Multiple hypotheses: don't jump to conclusions
Tagging importance: "If resources aren't tagged, this investigation takes 10x longer"
Immediate savings: identify and clean up orphaned resources
Prevention: implement cost alerts, enforce tagging policies, regular FinOps reviews
Dashboards: set up Kubecost or OpenCost for Kubernetes-specific cost visibility

FinOps & Cost Optimization (Topic Pack, L2) — FinOps
FinOps Drills (Drill, L2) — FinOps
Finops Flashcards (CLI) (flashcard_deck, L1) — FinOps
Skillcheck: FinOps (Assessment, L2) — FinOps

Scenario: Cloud Cost Spike Investigation¶

The Prompt¶

Initial Report¶

Constraints¶

Expected Investigation Path¶

Common Root Causes¶

What a Strong Answer Includes¶

Wiki Navigation¶

Pages that link here¶

Scenario: Cloud Cost Spike Investigation¶

The Prompt¶

Initial Report¶

Constraints¶

Expected Investigation Path¶

Common Root Causes¶

What a Strong Answer Includes¶

Wiki Navigation¶

Related Content¶

Pages that link here¶