Skip to content

Portal | Level: L2: Operations | Topics: FinOps | Domain: DevOps & Tooling

Cost Optimization & FinOps - Primer

Why This Matters

Cloud bills are the new data center lease. An over-provisioned cluster costs thousands per month. An under-provisioned one causes outages. FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending. As an SRE/DevOps engineer, you're often the person who can identify and fix the biggest cost drivers.

The FinOps Framework

Inform -> Optimize -> Operate

Inform:    Visibility into who spends what and why
Optimize:  Right-size, use commitments, eliminate waste
Operate:   Continuous governance, budgets, alerts

Kubernetes Cost Anatomy

Where the Money Goes

Resource Cost driver Optimization lever
Compute (nodes) CPU + memory reserved Right-size requests, use spot/preemptible
Storage (PVCs) Disk size + IOPS Right-size volumes, delete unused PVCs
Network Cross-AZ traffic, NAT gateway, load balancers Topology-aware routing, minimize cross-AZ
Control plane Managed K8s fee (EKS/GKE/AKS) Consolidate clusters
Observability Log/metric ingestion + storage Reduce cardinality, set retention

The Request/Limit Gap

[Actually Used: 100m CPU]  [Requested: 500m CPU]  [Limit: 1000m CPU]
     |----- waste ------|

You pay for REQUESTS, not usage. Over-requesting = paying for idle resources.

This is the single biggest cost optimization opportunity in Kubernetes.

Right-Sizing

Analyzing Resource Usage

# Current requests vs actual usage
kubectl top pods -n grokdevops
kubectl get pods -n grokdevops -o custom-columns=\
  'NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory'

# Using Prometheus (more accurate, historical)
# CPU: actual vs requested
sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{namespace="grokdevops",resource="cpu"}) by (pod)

VPA (Vertical Pod Autoscaler)

VPA analyzes actual usage and recommends (or auto-sets) resource requests:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: grokdevops-vpa
  namespace: grokdevops
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: grokdevops
  updatePolicy:
    updateMode: "Off"  # Start with recommendations only
  resourcePolicy:
    containerPolicies:
      - containerName: grokdevops
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: "2"
          memory: 2Gi
# View VPA recommendations
kubectl get vpa grokdevops-vpa -n grokdevops -o yaml | grep -A20 recommendation

Right-Sizing Process

  1. Deploy VPA in "Off" (recommendation-only) mode
  2. Collect 7 days of data
  3. Set requests to VPA's "target" recommendation
  4. Set limits to 2-3x the request (or remove limits for CPU)
  5. Monitor for OOMKills or throttling
  6. Repeat quarterly

Spot/Preemptible Nodes

Spot instances cost 60-90% less but can be reclaimed with 2-minute notice.

Safe for Spot

  • Stateless web apps behind a Deployment with multiple replicas
  • Batch jobs that can be retried
  • Dev/staging environments
  • CI/CD runners

Not Safe for Spot

  • Single-replica databases
  • Stateful workloads without graceful shutdown
  • Long-running jobs that can't checkpoint

Implementation

# Node affinity for spot nodes
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: node.kubernetes.io/instance-type
              operator: In
              values: ["spot"]

# Tolerate spot taints
tolerations:
  - key: "kubernetes.io/spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Pod Disruption Budgets

Protect your app from too many spot evictions at once:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: grokdevops-pdb
spec:
  minAvailable: 2  # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: grokdevops

Cluster Autoscaler & Karpenter

Cluster Autoscaler

Scales node groups up/down based on pending pods:

# Key settings
--scale-down-utilization-threshold=0.5   # Scale down when <50% utilized
--scale-down-delay-after-add=10m         # Wait 10m after adding a node
--scale-down-unneeded-time=10m           # Node must be underutilized for 10m

Karpenter (AWS)

More flexible and faster than Cluster Autoscaler. Provisions individual nodes based on pod requirements:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m5.large", "m5.xlarge", "m6i.large"]
      expireAfter: 720h  # Replace nodes after 30 days
  disruption:
    consolidationPolicy: WhenUnderutilized
  limits:
    cpu: "100"
    memory: 200Gi

Storage Optimization

# Find unbound PVCs (orphaned)
kubectl get pvc -A | grep -v Bound

# Find PVCs not mounted by any pod
kubectl get pvc -A -o json | jq -r '.items[] | select(.status.phase=="Bound") | .metadata.namespace + "/" + .metadata.name' | while read pvc; do
  ns=$(echo $pvc | cut -d/ -f1)
  name=$(echo $pvc | cut -d/ -f2)
  if ! kubectl get pods -n $ns -o json | jq -e ".items[].spec.volumes[]?.persistentVolumeClaim.claimName == \"$name\"" > /dev/null 2>&1; then
    echo "Unused PVC: $pvc"
  fi
done

Storage Class Tiers

Tier Use case Cost
SSD (gp3/pd-ssd) Databases, low-latency $$$
HDD (sc1/pd-standard) Logs, backups, archives $
Object storage (S3/GCS) Long-term storage, backups $

Network Cost Optimization

Cross-AZ Traffic

In AWS, cross-AZ traffic costs $0.01/GB each way. For high-traffic services, this adds up.

# Topology-aware routing (prefer same-zone)
apiVersion: v1
kind: Service
metadata:
  name: grokdevops
  annotations:
    service.kubernetes.io/topology-mode: Auto

NAT Gateway Costs

NAT gateways charge per GB processed. Reduce by: 1. Pulling images from ECR in the same region (VPC endpoint) 2. Using VPC endpoints for AWS services 3. Caching DNS lookups

Cost Visibility Tools

Tool Type Best for
Kubecost Open source Per-namespace/pod cost allocation
OpenCost CNCF project Standardized cost monitoring
CloudHealth SaaS Multi-cloud, enterprise
AWS Cost Explorer Native AWS-specific analysis
GCP Billing Native GCP-specific analysis

Kubecost Quick Setup

helm install kubecost cost-analyzer \
  --repo https://kubecost.github.io/cost-analyzer/ \
  --namespace kubecost --create-namespace \
  --set kubecostToken="your-token"

Quick Wins Checklist

  1. Right-size requests — Use VPA recommendations (biggest impact)
  2. Delete unused resources — Orphaned PVCs, idle load balancers, stopped pods
  3. Use spot for non-critical workloads — 60-90% savings
  4. Set resource quotas per namespace — Prevent teams from over-provisioning
  5. Enable cluster autoscaler — Scale down unused nodes
  6. Review log retention — Do you need 90 days of debug logs?
  7. Use topology-aware routing — Reduce cross-AZ traffic
  8. Schedule dev environments — Scale to 0 after hours

Wiki Navigation

Prerequisites

Next Steps