Ops Archaeology: The Pods That Won't Schedule¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L3 Estimated time: 40 min Domains: Kubernetes, Resource Quotas, HPA, Scheduling
Artifact 1: CLI Output¶
$ kubectl get pods -n marketplace
NAME READY STATUS RESTARTS AGE
catalog-service-7b9f8d6c45-g2h4j 1/1 Running 0 2h
catalog-service-7b9f8d6c45-k5m7n 1/1 Running 0 2h
catalog-service-7b9f8d6c45-p8q1r 1/1 Running 0 2h
$ kubectl get hpa -n marketplace
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
catalog-service Deployment/catalog-service 87%/70% 3 10 3 45d
$ kubectl get events -n marketplace --sort-by='.lastTimestamp' | tail -5
LAST SEEN TYPE REASON OBJECT MESSAGE
2m Warning FailedCreate replicaset/catalog-service-7b9f8d6c45 Error creating: pods "catalog-service-7b9f8d6c45-x9y2z" is forbidden: exceeded quota: compute-quota, requested: cpu=500m,memory=512Mi, used: cpu=1500m,memory=1536Mi, limited: cpu=2,memory=2Gi
45s Warning FailedCreate replicaset/catalog-service-7b9f8d6c45 (combined from similar events): Error creating: pods "catalog-service-7b9f8d6c45-a3b4c" is forbidden: exceeded quota
2m Normal SuccessfulRescale horizontalpodautoscaler/catalog-service New size: 6; reason: cpu resource utilization (percentage of request) above target
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node-pool-a-4g7h2 1240m 31% 4891Mi 30%
node-pool-a-8k3m5 1380m 34% 5102Mi 31%
node-pool-a-q5r9t 890m 22% 3847Mi 24%
node-pool-b-d2f4g 1102m 27% 4293Mi 26%
Artifact 2: Metrics¶
# HPA metrics
kube_horizontalpodautoscaler_status_current_replicas{hpa="catalog-service",namespace="marketplace"} 3
kube_horizontalpodautoscaler_status_desired_replicas{hpa="catalog-service",namespace="marketplace"} 6
kube_horizontalpodautoscaler_spec_max_replicas{hpa="catalog-service",namespace="marketplace"} 10
# Resource quota usage
kube_resourcequota{namespace="marketplace",resource="cpu",type="used"} 1.5
kube_resourcequota{namespace="marketplace",resource="cpu",type="hard"} 2
kube_resourcequota{namespace="marketplace",resource="memory",type="used"} 1610612736
kube_resourcequota{namespace="marketplace",resource="memory",type="hard"} 2147483648
kube_resourcequota{namespace="marketplace",resource="pods",type="used"} 3
kube_resourcequota{namespace="marketplace",resource="pods",type="hard"} 20
# Cluster-wide capacity
kube_node_status_allocatable_cpu_cores 16
kube_node_status_allocatable_memory_bytes 68719476736
Artifact 3: Infrastructure Code¶
# From: k8s/namespaces/marketplace.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: marketplace
spec:
hard:
requests.cpu: "2"
requests.memory: 2Gi
limits.cpu: "4"
limits.memory: 4Gi
pods: "20"
---
# From: helm/catalog-service/values.yaml
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
Artifact 4: Log Lines¶
[2024-12-18T15:44:02Z] hpa-controller | Event(v1.ObjectReference{Kind:"HorizontalPodAutoscaler", Namespace:"marketplace", Name:"catalog-service"}): type: 'Normal' reason: 'SuccessfulRescale' New size: 6; reason: cpu resource utilization (percentage of request) above target
[2024-12-18T15:44:05Z] replicaset-ctrl | Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"marketplace", Name:"catalog-service-7b9f8d6c45"}): type: 'Warning' reason: 'FailedCreate' Error creating: exceeded quota: compute-quota
[2024-12-18T15:30:12Z] cluster-autoscaler | scale_up: no unschedulable pods found
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?