Skip to content

Portal | Level: L2: Operations | Topics: FinOps | Domain: DevOps & Tooling

Scenario: Cloud Cost Spike Investigation

The Prompt

"Our AWS bill jumped 40% this month compared to last month. Engineering says nothing has changed. Finance wants an explanation and a fix by end of week. Where do you start?"

Initial Report

Finance email: "AWS bill went from $50K to $70K this month. The increase seems to be in EC2 and EBS. No new services were launched that I know of."

Constraints

  • Not an outage: No time pressure, but management wants answers fast.
  • Political: Teams may be defensive about their resource usage.
  • Need data: Must prove the cause with evidence, not guesswork.

Expected Investigation Path

# 1. AWS Cost Explorer — identify the service and time
# Filter by service: EC2, EBS, EKS
# Group by: tag (team/service), linked account, usage type
# Time: daily granularity to find when the spike started

# 2. Check for new/larger instances
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity DAILY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=USAGE_TYPE

# 3. Check Kubernetes node counts over time
kubectl get nodes --show-labels | wc -l
# Compare with last month's baseline

# 4. Check if cluster autoscaler / Karpenter scaled up
kubectl get events --field-selector reason=ScaledUp -A

# 5. Check for resource request inflation
kubectl top nodes
kubectl resource-capacity --util --sort cpu.request

# 6. Check for orphaned resources
# Unattached EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available \
  --query 'Volumes[].{ID:VolumeId,Size:Size,Created:CreateTime}'

# Unused load balancers
aws elbv2 describe-load-balancers --query 'LoadBalancers[].LoadBalancerArn'
# Cross-reference with active services

# 7. Check for spot instance fallback to on-demand
# Karpenter logs or ASG activity showing capacity type changes

Common Root Causes

  1. Cluster autoscaler added nodes — a service increased resource requests without right-sizing
  2. Spot instances fell back to on-demand — spot capacity unavailable in the region
  3. Orphaned EBS volumes — PVCs deleted but volumes retained (reclaimPolicy: Retain)
  4. New environment spun up — someone created a staging cluster and forgot about it
  5. Log/metric volume explosion — CloudWatch or S3 costs from increased logging
  6. Data transfer — cross-AZ or cross-region traffic increase

What a Strong Answer Includes

  • Structured investigation: start with Cost Explorer, then drill down by service/tag/time
  • Multiple hypotheses: don't jump to conclusions
  • Tagging importance: "If resources aren't tagged, this investigation takes 10x longer"
  • Immediate savings: identify and clean up orphaned resources
  • Prevention: implement cost alerts, enforce tagging policies, regular FinOps reviews
  • Dashboards: set up Kubecost or OpenCost for Kubernetes-specific cost visibility

Wiki Navigation