Portal | Level: L2: Operations | Topics: FinOps | Domain: DevOps & Tooling
FinOps & Cost Optimization Drills¶
Remember: The FinOps cycle: Inform (visibility — who spends what) -> Optimize (right-size, reserved instances, spot) -> Operate (governance, budgets, alerts). You cannot optimize what you cannot measure. Start with tagging and cost allocation before buying reserved instances.
Gotcha: Kubernetes resource requests determine scheduling and cost, not limits. A pod requesting 4 CPU but using 0.5 CPU wastes 3.5 cores of cluster capacity. Right-sizing requests (not limits) is where the real savings are. Use VPA recommendations or metrics like
container_cpu_usage_seconds_totalvskube_pod_container_resource_requeststo find waste.
Drill 1: Identify Over-Provisioned Pods¶
Difficulty: Easy
Q: Write a kubectl command to find pods requesting more than 1 CPU or 2Gi memory in the production namespace.
Answer
# CPU > 1 core
kubectl get pods -n production -o json | jq -r '
.items[] | .spec.containers[] |
select(.resources.requests.cpu != null) |
select((.resources.requests.cpu | rtrimstr("m") | tonumber) > 1000 or
(.resources.requests.cpu | test("^[0-9]+$") and (.resources.requests.cpu | tonumber) > 1)) |
"\(.name): cpu=\(.resources.requests.cpu)"'
# Simpler: use kubectl-resource-capacity plugin
kubectl resource-capacity -n production --sort cpu.request --pods
Drill 2: Set Up VPA Recommendations¶
Difficulty: Easy
Q: Create a VPA in recommendation-only mode for the api-server Deployment.
Answer
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 4
memory: 8Gi
Drill 3: ResourceQuota¶
Difficulty: Easy
Q: Create a ResourceQuota for team-alpha namespace limiting total requests to 10 CPU / 20Gi memory and max 30 pods.
Answer
Drill 4: LimitRange Defaults¶
Difficulty: Easy
Q: Create a LimitRange that sets default requests (100m CPU, 128Mi) and limits (500m CPU, 512Mi) for any container that doesn't specify them.
Answer
- `default` = applied as limits if none specified - `defaultRequest` = applied as requests if none specified - `max`/`min` = hard bounds even if the user specifies valuesDrill 5: Spot Instance Workloads¶
Difficulty: Medium
Q: Configure a Deployment to prefer spot instances but tolerate being scheduled on on-demand if spot is unavailable. Ensure pods spread across zones.
Answer
apiVersion: apps/v1
kind: Deployment
metadata:
name: worker
spec:
replicas: 10
template:
spec:
tolerations:
- key: karpenter.sh/capacity-type
operator: Equal
value: spot
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: worker
terminationGracePeriodSeconds: 30
containers:
- name: worker
# Handle SIGTERM gracefully for spot interruptions
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
Drill 6: Cost PromQL Queries¶
Difficulty: Medium
Q: Write PromQL queries to find: (a) total CPU waste, (b) most over-provisioned namespaces, (c) idle pods.
Answer
# (a) Total CPU waste: requested minus actually used
sum(kube_pod_container_resource_requests{resource="cpu", unit="core"})
-
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
# (b) Namespace CPU overprovisioning ratio
sort_desc(
sum by(namespace)(kube_pod_container_resource_requests{resource="cpu"})
/
sum by(namespace)(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
)
# Ratio > 3 means requesting 3x what's actually used
# (c) Pods with near-zero CPU usage (< 1m) over the last hour
sum by(namespace, pod)(rate(container_cpu_usage_seconds_total{container!=""}[1h])) < 0.001
# (d) Memory waste by namespace
sum by(namespace)(kube_pod_container_resource_requests{resource="memory"})
-
sum by(namespace)(container_memory_working_set_bytes{container!=""})
Drill 7: Karpenter Consolidation¶
Difficulty: Medium
Q: Configure Karpenter to consolidate underutilized nodes and use multiple instance types for cost optimization.
Answer
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.large
- m5.xlarge
- m5a.large
- m5a.xlarge
- m6i.large
- m6i.xlarge
- c5.large
- c5.xlarge
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
limits:
cpu: "100"
memory: "400Gi"
Drill 8: Scheduled Scaling¶
Difficulty: Medium
Q: Scale down dev/staging environments outside business hours (6pm-8am and weekends) to save costs.
Answer
# Using a CronJob to scale down
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-down-dev
namespace: dev
spec:
schedule: "0 18 * * 1-5" # 6pm Mon-Fri
jobTemplate:
spec:
template:
spec:
serviceAccountName: scaler
containers:
- name: kubectl
image: bitnami/kubectl
command:
- /bin/sh
- -c
- |
for deploy in $(kubectl get deploy -n dev -o name); do
kubectl scale $deploy -n dev --replicas=0
done
restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-up-dev
namespace: dev
spec:
schedule: "0 8 * * 1-5" # 8am Mon-Fri
jobTemplate:
spec:
template:
spec:
serviceAccountName: scaler
containers:
- name: kubectl
image: bitnami/kubectl
command:
- /bin/sh
- -c
- |
for deploy in $(kubectl get deploy -n dev -o name); do
kubectl scale $deploy -n dev --replicas=1
done
restartPolicy: OnFailure
Drill 9: PVC Cost Audit¶
Difficulty: Easy
Q: Find unbound or unused PVCs that are costing money.
Answer
# Unbound PVCs (provisioned but not attached)
kubectl get pvc -A | grep -v Bound
# PVCs not mounted by any pod
kubectl get pvc -A -o json | jq -r '
.items[] |
select(.status.phase == "Bound") |
"\(.metadata.namespace)/\(.metadata.name) - \(.spec.resources.requests.storage)"' | \
while read pvc; do
ns=$(echo $pvc | cut -d/ -f1)
name=$(echo $pvc | cut -d/ -f2 | cut -d' ' -f1)
used=$(kubectl get pods -n $ns -o json | jq -r \
".items[].spec.volumes[]? | select(.persistentVolumeClaim.claimName == \"$name\")" 2>/dev/null)
if [ -z "$used" ]; then
echo "UNUSED: $pvc"
fi
done
# Check total PV storage provisioned
kubectl get pv -o json | jq '[.items[].spec.capacity.storage | rtrimstr("Gi") | tonumber] | add'
Drill 10: Cost Allocation Tags¶
Difficulty: Easy
Q: Write a Kyverno policy that requires all Deployments to have a cost-center label for showback/chargeback reporting.
Answer
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-cost-center
spec:
validationFailureAction: Enforce
rules:
- name: check-cost-center
match:
any:
- resources:
kinds: ["Deployment", "StatefulSet", "DaemonSet"]
exclude:
any:
- resources:
namespaces: ["kube-system", "kube-public", "monitoring"]
validate:
message: "A 'cost-center' label is required for cost allocation. Example: cost-center=engineering-platform"
pattern:
metadata:
labels:
cost-center: "?*"
Wiki Navigation¶
Prerequisites¶
- FinOps & Cost Optimization (Topic Pack, L2)
Related Content¶
- FinOps & Cost Optimization (Topic Pack, L2) — FinOps
- Finops Flashcards (CLI) (flashcard_deck, L1) — FinOps
- Interview: Cost Spike Investigation (Scenario, L2) — FinOps
- Skillcheck: FinOps (Assessment, L2) — FinOps