Ops Archaeology: The Pods That Won't Schedule¶

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L3 Estimated time: 40 min Domains: Kubernetes, Resource Quotas, HPA, Scheduling

Artifact 1: CLI Output¶

$ kubectl get pods -n marketplace
NAME                                    READY   STATUS    RESTARTS   AGE
catalog-service-7b9f8d6c45-g2h4j       1/1     Running   0          2h
catalog-service-7b9f8d6c45-k5m7n       1/1     Running   0          2h
catalog-service-7b9f8d6c45-p8q1r       1/1     Running   0          2h

$ kubectl get hpa -n marketplace
NAME              REFERENCE                    TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
catalog-service   Deployment/catalog-service   87%/70%         3         10        3          45d

$ kubectl get events -n marketplace --sort-by='.lastTimestamp' | tail -5
LAST SEEN   TYPE      REASON              OBJECT                                     MESSAGE
2m          Warning   FailedCreate        replicaset/catalog-service-7b9f8d6c45      Error creating: pods "catalog-service-7b9f8d6c45-x9y2z" is forbidden: exceeded quota: compute-quota, requested: cpu=500m,memory=512Mi, used: cpu=1500m,memory=1536Mi, limited: cpu=2,memory=2Gi
45s         Warning   FailedCreate        replicaset/catalog-service-7b9f8d6c45      (combined from similar events): Error creating: pods "catalog-service-7b9f8d6c45-a3b4c" is forbidden: exceeded quota
2m          Normal    SuccessfulRescale   horizontalpodautoscaler/catalog-service    New size: 6; reason: cpu resource utilization (percentage of request) above target

$ kubectl top nodes
NAME                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-pool-a-4g7h2             1240m        31%    4891Mi          30%
node-pool-a-8k3m5             1380m        34%    5102Mi          31%
node-pool-a-q5r9t             890m         22%    3847Mi          24%
node-pool-b-d2f4g             1102m        27%    4293Mi          26%

Artifact 2: Metrics¶

# HPA metrics
kube_horizontalpodautoscaler_status_current_replicas{hpa="catalog-service",namespace="marketplace"} 3
kube_horizontalpodautoscaler_status_desired_replicas{hpa="catalog-service",namespace="marketplace"} 6
kube_horizontalpodautoscaler_spec_max_replicas{hpa="catalog-service",namespace="marketplace"} 10

# Resource quota usage
kube_resourcequota{namespace="marketplace",resource="cpu",type="used"} 1.5
kube_resourcequota{namespace="marketplace",resource="cpu",type="hard"} 2
kube_resourcequota{namespace="marketplace",resource="memory",type="used"} 1610612736
kube_resourcequota{namespace="marketplace",resource="memory",type="hard"} 2147483648
kube_resourcequota{namespace="marketplace",resource="pods",type="used"} 3
kube_resourcequota{namespace="marketplace",resource="pods",type="hard"} 20

# Cluster-wide capacity
kube_node_status_allocatable_cpu_cores 16
kube_node_status_allocatable_memory_bytes 68719476736

Artifact 3: Infrastructure Code¶

# From: k8s/namespaces/marketplace.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: marketplace
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 2Gi
    limits.cpu: "4"
    limits.memory: 4Gi
    pods: "20"
---
# From: helm/catalog-service/values.yaml
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

Artifact 4: Log Lines¶

[2024-12-18T15:44:02Z] hpa-controller      | Event(v1.ObjectReference{Kind:"HorizontalPodAutoscaler", Namespace:"marketplace", Name:"catalog-service"}): type: 'Normal' reason: 'SuccessfulRescale' New size: 6; reason: cpu resource utilization (percentage of request) above target
[2024-12-18T15:44:05Z] replicaset-ctrl     | Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"marketplace", Name:"catalog-service-7b9f8d6c45"}): type: 'Warning' reason: 'FailedCreate' Error creating: exceeded quota: compute-quota
[2024-12-18T15:30:12Z] cluster-autoscaler  | scale_up: no unschedulable pods found

Your Mission¶

Reconstruct: What does this system do? What are its components and purpose?
Diagnose: What is currently broken or degraded, and why?
Propose: What would you do to fix it? What would you check first?