Portal | Level: L2: Operations | Topics: FinOps | Domain: DevOps & Tooling

Cost Optimization & FinOps - Primer¶

Why This Matters¶

Cloud bills are the new data center lease. An over-provisioned cluster costs thousands per month. An under-provisioned one causes outages. FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending. As an SRE/DevOps engineer, you're often the person who can identify and fix the biggest cost drivers.

The FinOps Framework¶

Inform -> Optimize -> Operate

Inform:    Visibility into who spends what and why
Optimize:  Right-size, use commitments, eliminate waste
Operate:   Continuous governance, budgets, alerts

Kubernetes Cost Anatomy¶

Where the Money Goes¶

Resource	Cost driver	Optimization lever
Compute (nodes)	CPU + memory reserved	Right-size requests, use spot/preemptible
Storage (PVCs)	Disk size + IOPS	Right-size volumes, delete unused PVCs
Network	Cross-AZ traffic, NAT gateway, load balancers	Topology-aware routing, minimize cross-AZ
Control plane	Managed K8s fee (EKS/GKE/AKS)	Consolidate clusters
Observability	Log/metric ingestion + storage	Reduce cardinality, set retention

The Request/Limit Gap¶

[Actually Used: 100m CPU]  [Requested: 500m CPU]  [Limit: 1000m CPU]
     |----- waste ------|

You pay for REQUESTS, not usage. Over-requesting = paying for idle resources.

This is the single biggest cost optimization opportunity in Kubernetes.

Right-Sizing¶

Analyzing Resource Usage¶

# Current requests vs actual usage
kubectl top pods -n grokdevops
kubectl get pods -n grokdevops -o custom-columns=\
  'NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory'

# Using Prometheus (more accurate, historical)
# CPU: actual vs requested
sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{namespace="grokdevops",resource="cpu"}) by (pod)

VPA (Vertical Pod Autoscaler)¶

VPA analyzes actual usage and recommends (or auto-sets) resource requests:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: grokdevops-vpa
  namespace: grokdevops
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: grokdevops
  updatePolicy:
    updateMode: "Off"  # Start with recommendations only
  resourcePolicy:
    containerPolicies:
      - containerName: grokdevops
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: "2"
          memory: 2Gi

# View VPA recommendations
kubectl get vpa grokdevops-vpa -n grokdevops -o yaml | grep -A20 recommendation

Right-Sizing Process¶

Deploy VPA in "Off" (recommendation-only) mode
Collect 7 days of data
Set requests to VPA's "target" recommendation
Set limits to 2-3x the request (or remove limits for CPU)
Monitor for OOMKills or throttling
Repeat quarterly

Spot/Preemptible Nodes¶

Spot instances cost 60-90% less but can be reclaimed with 2-minute notice.

Safe for Spot¶

Stateless web apps behind a Deployment with multiple replicas
Batch jobs that can be retried
Dev/staging environments
CI/CD runners

Not Safe for Spot¶

Single-replica databases
Stateful workloads without graceful shutdown
Long-running jobs that can't checkpoint

Implementation¶

# Node affinity for spot nodes
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: node.kubernetes.io/instance-type
              operator: In
              values: ["spot"]

# Tolerate spot taints
tolerations:
  - key: "kubernetes.io/spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Pod Disruption Budgets¶

Protect your app from too many spot evictions at once:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: grokdevops-pdb
spec:
  minAvailable: 2  # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: grokdevops

Cluster Autoscaler & Karpenter¶

Cluster Autoscaler¶

Scales node groups up/down based on pending pods:

# Key settings
--scale-down-utilization-threshold=0.5   # Scale down when <50% utilized
--scale-down-delay-after-add=10m         # Wait 10m after adding a node
--scale-down-unneeded-time=10m           # Node must be underutilized for 10m

Karpenter (AWS)¶

More flexible and faster than Cluster Autoscaler. Provisions individual nodes based on pod requirements:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m5.large", "m5.xlarge", "m6i.large"]
      expireAfter: 720h  # Replace nodes after 30 days
  disruption:
    consolidationPolicy: WhenUnderutilized
  limits:
    cpu: "100"
    memory: 200Gi

Storage Optimization¶

# Find unbound PVCs (orphaned)
kubectl get pvc -A | grep -v Bound

# Find PVCs not mounted by any pod
kubectl get pvc -A -o json | jq -r '.items[] | select(.status.phase=="Bound") | .metadata.namespace + "/" + .metadata.name' | while read pvc; do
  ns=$(echo $pvc | cut -d/ -f1)
  name=$(echo $pvc | cut -d/ -f2)
  if ! kubectl get pods -n $ns -o json | jq -e ".items[].spec.volumes[]?.persistentVolumeClaim.claimName == \"$name\"" > /dev/null 2>&1; then
    echo "Unused PVC: $pvc"
  fi
done

Storage Class Tiers¶

Tier	Use case	Cost
SSD (gp3/pd-ssd)	Databases, low-latency	$$$
HDD (sc1/pd-standard)	Logs, backups, archives	$
Object storage (S3/GCS)	Long-term storage, backups	$

Network Cost Optimization¶

Cross-AZ Traffic¶

In AWS, cross-AZ traffic costs $0.01/GB each way. For high-traffic services, this adds up.

# Topology-aware routing (prefer same-zone)
apiVersion: v1
kind: Service
metadata:
  name: grokdevops
  annotations:
    service.kubernetes.io/topology-mode: Auto

NAT Gateway Costs¶

NAT gateways charge per GB processed. Reduce by: 1. Pulling images from ECR in the same region (VPC endpoint) 2. Using VPC endpoints for AWS services 3. Caching DNS lookups

Cost Visibility Tools¶

Tool	Type	Best for
Kubecost	Open source	Per-namespace/pod cost allocation
OpenCost	CNCF project	Standardized cost monitoring
CloudHealth	SaaS	Multi-cloud, enterprise
AWS Cost Explorer	Native	AWS-specific analysis
GCP Billing	Native	GCP-specific analysis

Kubecost Quick Setup¶

helm install kubecost cost-analyzer \
  --repo https://kubecost.github.io/cost-analyzer/ \
  --namespace kubecost --create-namespace \
  --set kubecostToken="your-token"

Quick Wins Checklist¶

Right-size requests — Use VPA recommendations (biggest impact)
Delete unused resources — Orphaned PVCs, idle load balancers, stopped pods
Use spot for non-critical workloads — 60-90% savings
Set resource quotas per namespace — Prevent teams from over-provisioning
Enable cluster autoscaler — Scale down unused nodes
Review log retention — Do you need 90 days of debug logs?
Use topology-aware routing — Reduce cross-AZ traffic
Schedule dev environments — Scale to 0 after hours

Prerequisites¶

Cloud Ops Basics (Topic Pack, L1)

Next Steps¶

FinOps Drills (Drill, L2)
Skillcheck: FinOps (Assessment, L2)

FinOps Drills (Drill, L2) — FinOps
Finops Flashcards (CLI) (flashcard_deck, L1) — FinOps
Interview: Cost Spike Investigation (Scenario, L2) — FinOps
Skillcheck: FinOps (Assessment, L2) — FinOps

Cost Optimization & FinOps - Primer¶

Why This Matters¶

The FinOps Framework¶

Kubernetes Cost Anatomy¶

Where the Money Goes¶

The Request/Limit Gap¶

Right-Sizing¶

Analyzing Resource Usage¶

VPA (Vertical Pod Autoscaler)¶

Right-Sizing Process¶

Spot/Preemptible Nodes¶

Safe for Spot¶

Not Safe for Spot¶

Implementation¶

Pod Disruption Budgets¶

Cluster Autoscaler & Karpenter¶

Cluster Autoscaler¶

Karpenter (AWS)¶

Storage Optimization¶

Storage Class Tiers¶

Network Cost Optimization¶

Cross-AZ Traffic¶

NAT Gateway Costs¶

Cost Visibility Tools¶

Kubecost Quick Setup¶

Quick Wins Checklist¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Pages that link here¶

Cost Optimization & FinOps - Primer¶

Why This Matters¶

The FinOps Framework¶

Kubernetes Cost Anatomy¶

Where the Money Goes¶

The Request/Limit Gap¶

Right-Sizing¶

Analyzing Resource Usage¶

VPA (Vertical Pod Autoscaler)¶

Right-Sizing Process¶

Spot/Preemptible Nodes¶

Safe for Spot¶

Not Safe for Spot¶

Implementation¶

Pod Disruption Budgets¶

Cluster Autoscaler & Karpenter¶

Cluster Autoscaler¶

Karpenter (AWS)¶

Storage Optimization¶

Storage Class Tiers¶

Network Cost Optimization¶

Cross-AZ Traffic¶

NAT Gateway Costs¶

Cost Visibility Tools¶

Kubecost Quick Setup¶

Quick Wins Checklist¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Related Content¶

Pages that link here¶