Cost Optimization & FinOps - Street-Level Ops¶

Practical cost reduction patterns from production clusters.

Quick Cost Audit¶

# Node count and sizes
kubectl get nodes -o custom-columns='NAME:.metadata.name,TYPE:.metadata.labels.node\.kubernetes\.io/instance-type,ZONE:.metadata.labels.topology\.kubernetes\.io/zone'

# Cluster-wide resource allocation
kubectl describe nodes | grep -E "(Name:|Allocated|requests)"

# Top resource consumers
kubectl top pods -A --sort-by=cpu | head -20
kubectl top pods -A --sort-by=memory | head -20

# Count pods per namespace
kubectl get pods -A --no-headers | awk '{print $1}' | sort | uniq -c | sort -rn

# Find PVCs and their sizes
kubectl get pvc -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,SIZE:.spec.resources.requests.storage,STATUS:.status.phase'

One-liner: Find pods requesting CPU but barely using it: kubectl top pods -A --no-headers | awk '$3+0 < 10 {print $1, $2, "cpu="$3}' — these are right-sizing candidates.

Debug clue: kubectl describe nodes | grep -A5 "Allocated" shows the gap between requested and allocatable. If requests total 90% but actual usage is 30%, you are massively over-provisioned.

Pattern: The Monthly Cost Review¶

Run this checklist monthly:

Right-sizing: Compare VPA recommendations to current requests
Orphaned resources: PVCs, Services (LoadBalancer), unused ConfigMaps
Node utilization: Target 50-70% average CPU
Spot coverage: What percentage of workloads are on spot?
Log volume: Check Loki/CloudWatch ingestion rates
Reserved capacity: Are reserved instances/savings plans still right-sized?

Pattern: Resource Request Guidelines¶

Workload type	CPU request	Memory request	CPU limit
Web API	p95 usage	p99 usage + 20%	None or 4x request
Background worker	p95 usage	p99 usage + 20%	None
Database	Dedicated	Dedicated + buffer	Equal to request
Batch job	Average usage	Peak usage	None

Why no CPU limits for most workloads: CPU limits cause throttling even when the node has idle CPU. This increases latency without saving money. Memory limits are essential (OOMKill is better than node instability).

Under the hood: CPU throttling happens via CFS (Completely Fair Scheduler) bandwidth control. Even if the node has 50% idle CPU, a pod at its limit gets throttled. Check container_cpu_cfs_throttled_seconds_total in Prometheus to find victims.

Default trap: Kubernetes defaults to no resource requests/limits. Without requests, the scheduler cannot bin-pack efficiently, and pods compete freely for CPU during contention — your latency-sensitive API gets starved by a batch job.

Pattern: Namespace Budget Alerts¶

# Prometheus alert: namespace cost exceeding budget
groups:
  - name: cost-alerts
    rules:
      - alert: NamespaceCPUBudgetExceeded
        expr: |
          sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"}) > 8
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Namespace {{ $labels.namespace }} requesting >8 CPU cores"

      - alert: OrphanedPVCs
        expr: |
          kube_persistentvolumeclaim_status_phase{phase="Bound"} == 1
          unless on (persistentvolumeclaim, namespace)
          kube_pod_spec_volumes_persistentvolumeclaims_info
        for: 24h
        labels:
          severity: info
        annotations:
          summary: "PVC {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} not mounted for 24h"

War story: A team ran three m5.2xlarge nodes in dev "to match prod." Monthly cost: $2,100. Average CPU usage: 8%. Switching to a single t3.large spot instance with scale-to-zero after 7 PM saved $1,900/month — enough to fund their observability tooling.

Anti-Pattern: Oversized Dev Environments¶

Dev clusters that mirror production sizing waste money:

# Dev should have:
# - Fewer replicas (1 instead of 3)
# - Smaller resource requests (50% of prod)
# - Smaller PVCs
# - No multi-AZ
# - Spot instances only
# - Scale to 0 after hours

Anti-Pattern: LoadBalancer per Service¶

Each type: LoadBalancer Service creates a cloud load balancer ($15-25/month each).

Fix: Use an Ingress controller (one LB for all services):

# Instead of 10 LoadBalancer Services ($200/month)
# Use 1 Ingress controller + 10 Ingress rules ($20/month)

Scale note: In AWS, each LoadBalancer Service also creates an ENI per subnet and a security group. At scale you hit VPC limits (default 5000 security groups). NLB is cheaper than ALB for pure TCP, but ALB supports path-based routing which further reduces LB count.

Remember: FinOps cost-driver mnemonic: CDRN — Compute, Data transfer, Reserved capacity gaps, NAT gateways. Review all four monthly. Data transfer and NAT are the "invisible" costs that blindside teams who only watch compute.

Gotcha: Daemonsets and Sidecars¶

Daemonsets run on every node. Each sidecar (mesh proxy, log collector) adds overhead to every pod.

# Calculate daemonset overhead
kubectl get ds -A -o custom-columns='NAME:.metadata.name,CPU:.spec.template.spec.containers[*].resources.requests.cpu,MEM:.spec.template.spec.containers[*].resources.requests.memory'

# Multiply by node count for total overhead

Gotcha: NAT Gateway data processing charges are the silent killer in AWS. At $0.045/GB, a cluster pulling 100GB/day of container images through NAT costs $135/month just in data charges. Use VPC endpoints for ECR/S3 to eliminate this.

Gotcha: Orphaned EBS volumes persist after you delete the EC2 instance or PV. At $0.10/GB/month, a forgotten 500GB volume costs $600/year. Run aws ec2 describe-volumes --filters Name=status,Values=available --query 'Volumes[].{ID:VolumeId,Size:Size,Created:CreateTime}' monthly to find them.

Quick Savings Calculator¶

Monthly node cost: $X
Nodes in cluster: N
Average utilization: U%

Potential savings from right-sizing:
  If U < 40%: can likely reduce nodes by 30-40%
  If U = 40-60%: well-optimized
  If U > 70%: may need more nodes for reliability

Spot savings (for eligible workloads):
  Current on-demand cost * eligible_fraction * 0.7

Dev/staging off-hours savings:
  Node cost * (14 off-hours / 24 hours) * (5 weekdays / 7 days) = ~42% savings

Quick Reference¶

Cheatsheet: Finops