Portal | Level: L2: Operations | Topics: FinOps | Domain: DevOps & Tooling
Cost Optimization & FinOps - Primer¶
Why This Matters¶
Cloud bills are the new data center lease. An over-provisioned cluster costs thousands per month. An under-provisioned one causes outages. FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending. As an SRE/DevOps engineer, you're often the person who can identify and fix the biggest cost drivers.
The FinOps Framework¶
Inform -> Optimize -> Operate
Inform: Visibility into who spends what and why
Optimize: Right-size, use commitments, eliminate waste
Operate: Continuous governance, budgets, alerts
Kubernetes Cost Anatomy¶
Where the Money Goes¶
| Resource | Cost driver | Optimization lever |
|---|---|---|
| Compute (nodes) | CPU + memory reserved | Right-size requests, use spot/preemptible |
| Storage (PVCs) | Disk size + IOPS | Right-size volumes, delete unused PVCs |
| Network | Cross-AZ traffic, NAT gateway, load balancers | Topology-aware routing, minimize cross-AZ |
| Control plane | Managed K8s fee (EKS/GKE/AKS) | Consolidate clusters |
| Observability | Log/metric ingestion + storage | Reduce cardinality, set retention |
The Request/Limit Gap¶
[Actually Used: 100m CPU] [Requested: 500m CPU] [Limit: 1000m CPU]
|----- waste ------|
You pay for REQUESTS, not usage. Over-requesting = paying for idle resources.
This is the single biggest cost optimization opportunity in Kubernetes.
Right-Sizing¶
Analyzing Resource Usage¶
# Current requests vs actual usage
kubectl top pods -n grokdevops
kubectl get pods -n grokdevops -o custom-columns=\
'NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory'
# Using Prometheus (more accurate, historical)
# CPU: actual vs requested
sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{namespace="grokdevops",resource="cpu"}) by (pod)
VPA (Vertical Pod Autoscaler)¶
VPA analyzes actual usage and recommends (or auto-sets) resource requests:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: grokdevops-vpa
namespace: grokdevops
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: grokdevops
updatePolicy:
updateMode: "Off" # Start with recommendations only
resourcePolicy:
containerPolicies:
- containerName: grokdevops
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: "2"
memory: 2Gi
# View VPA recommendations
kubectl get vpa grokdevops-vpa -n grokdevops -o yaml | grep -A20 recommendation
Right-Sizing Process¶
- Deploy VPA in "Off" (recommendation-only) mode
- Collect 7 days of data
- Set requests to VPA's "target" recommendation
- Set limits to 2-3x the request (or remove limits for CPU)
- Monitor for OOMKills or throttling
- Repeat quarterly
Spot/Preemptible Nodes¶
Spot instances cost 60-90% less but can be reclaimed with 2-minute notice.
Safe for Spot¶
- Stateless web apps behind a Deployment with multiple replicas
- Batch jobs that can be retried
- Dev/staging environments
- CI/CD runners
Not Safe for Spot¶
- Single-replica databases
- Stateful workloads without graceful shutdown
- Long-running jobs that can't checkpoint
Implementation¶
# Node affinity for spot nodes
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: ["spot"]
# Tolerate spot taints
tolerations:
- key: "kubernetes.io/spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Pod Disruption Budgets¶
Protect your app from too many spot evictions at once:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: grokdevops-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: grokdevops
Cluster Autoscaler & Karpenter¶
Cluster Autoscaler¶
Scales node groups up/down based on pending pods:
# Key settings
--scale-down-utilization-threshold=0.5 # Scale down when <50% utilized
--scale-down-delay-after-add=10m # Wait 10m after adding a node
--scale-down-unneeded-time=10m # Node must be underutilized for 10m
Karpenter (AWS)¶
More flexible and faster than Cluster Autoscaler. Provisions individual nodes based on pod requirements:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.xlarge", "m6i.large"]
expireAfter: 720h # Replace nodes after 30 days
disruption:
consolidationPolicy: WhenUnderutilized
limits:
cpu: "100"
memory: 200Gi
Storage Optimization¶
# Find unbound PVCs (orphaned)
kubectl get pvc -A | grep -v Bound
# Find PVCs not mounted by any pod
kubectl get pvc -A -o json | jq -r '.items[] | select(.status.phase=="Bound") | .metadata.namespace + "/" + .metadata.name' | while read pvc; do
ns=$(echo $pvc | cut -d/ -f1)
name=$(echo $pvc | cut -d/ -f2)
if ! kubectl get pods -n $ns -o json | jq -e ".items[].spec.volumes[]?.persistentVolumeClaim.claimName == \"$name\"" > /dev/null 2>&1; then
echo "Unused PVC: $pvc"
fi
done
Storage Class Tiers¶
| Tier | Use case | Cost |
|---|---|---|
| SSD (gp3/pd-ssd) | Databases, low-latency | $$$ |
| HDD (sc1/pd-standard) | Logs, backups, archives | $ |
| Object storage (S3/GCS) | Long-term storage, backups | $ |
Network Cost Optimization¶
Cross-AZ Traffic¶
In AWS, cross-AZ traffic costs $0.01/GB each way. For high-traffic services, this adds up.
# Topology-aware routing (prefer same-zone)
apiVersion: v1
kind: Service
metadata:
name: grokdevops
annotations:
service.kubernetes.io/topology-mode: Auto
NAT Gateway Costs¶
NAT gateways charge per GB processed. Reduce by: 1. Pulling images from ECR in the same region (VPC endpoint) 2. Using VPC endpoints for AWS services 3. Caching DNS lookups
Cost Visibility Tools¶
| Tool | Type | Best for |
|---|---|---|
| Kubecost | Open source | Per-namespace/pod cost allocation |
| OpenCost | CNCF project | Standardized cost monitoring |
| CloudHealth | SaaS | Multi-cloud, enterprise |
| AWS Cost Explorer | Native | AWS-specific analysis |
| GCP Billing | Native | GCP-specific analysis |
Kubecost Quick Setup¶
helm install kubecost cost-analyzer \
--repo https://kubecost.github.io/cost-analyzer/ \
--namespace kubecost --create-namespace \
--set kubecostToken="your-token"
Quick Wins Checklist¶
- Right-size requests — Use VPA recommendations (biggest impact)
- Delete unused resources — Orphaned PVCs, idle load balancers, stopped pods
- Use spot for non-critical workloads — 60-90% savings
- Set resource quotas per namespace — Prevent teams from over-provisioning
- Enable cluster autoscaler — Scale down unused nodes
- Review log retention — Do you need 90 days of debug logs?
- Use topology-aware routing — Reduce cross-AZ traffic
- Schedule dev environments — Scale to 0 after hours
Wiki Navigation¶
Prerequisites¶
- Cloud Ops Basics (Topic Pack, L1)
Next Steps¶
- FinOps Drills (Drill, L2)
- Skillcheck: FinOps (Assessment, L2)
Related Content¶
- FinOps Drills (Drill, L2) — FinOps
- Finops Flashcards (CLI) (flashcard_deck, L1) — FinOps
- Interview: Cost Spike Investigation (Scenario, L2) — FinOps
- Skillcheck: FinOps (Assessment, L2) — FinOps
Pages that link here¶
- Anti-Primer: Finops
- Certification Prep: AWS SAA — Solutions Architect Associate
- Cloud Ops Basics
- FinOps & Cost Optimization
- FinOps & Cost Optimization Drills
- FinOps / Cost Optimization - Skill Check
- Level 7: SRE & Cloud Operations
- Master Curriculum: 40 Weeks
- Scenario: Cloud Cost Spike Investigation
- Track: Cloud & FinOps