K8s Concept Chain — Street-Level Ops¶

Troubleshooting each layer of the concept chain in production.

Quick Diagnosis: Which Layer Is Broken?¶

Run these in order when something isn't working:

# 1. Are pods running?
kubectl get pods -l app=NAME

# 2. Why isn't a pod running?
kubectl describe pod NAME
kubectl logs NAME --previous    # logs from the last crash

# 3. Does the Service have endpoints?
kubectl get endpoints NAME

# 4. Can you reach the Service from inside the cluster?
kubectl run debug --rm -it --image=nicolaka/netshoot -- curl http://NAME:PORT

# 5. Is Ingress configured and the controller healthy?
kubectl get ingress
kubectl get pods -n ingress-nginx

# 6. Are resources the bottleneck?
kubectl top pods
kubectl describe node NODE_NAME | grep -A 5 "Allocated resources"

Layer-by-Layer Troubleshooting¶

Pod won't start¶

Symptom	Likely cause	Check
`ImagePullBackOff`	Wrong image name/tag, no pull secret	`kubectl describe pod` — Events section
`CrashLoopBackOff`	App crashes on startup	`kubectl logs NAME --previous`
`Pending`	No node with enough resources	`kubectl describe pod` — look for "Insufficient cpu/memory"
`ContainerCreating` stuck	Volume mount or secret missing	`kubectl describe pod` — Events section

Service has no endpoints¶

# Check that selector labels match pod labels
kubectl get svc NAME -o yaml | grep -A 3 selector
kubectl get pods --show-labels | grep app=NAME

If labels don't match, the Service selector is wrong. Fix the Service or the pod template labels.

Ingress returns 404 or 503¶

# Verify Ingress Controller is running
kubectl get pods -n ingress-nginx

# Check Ingress rules
kubectl describe ingress NAME

# Test the backend Service directly
kubectl run debug --rm -it --image=nicolaka/netshoot -- \
  curl http://BACKEND_SERVICE:PORT

Common causes: wrong serviceName or servicePort in the Ingress spec, or the backend Service itself has no endpoints.

HPA not scaling¶

# Check HPA status and current metrics
kubectl get hpa NAME

# If TARGETS shows <unknown>
# → metrics-server is not installed or not working
kubectl get pods -n kube-system | grep metrics-server

Pods stuck Pending (node capacity)¶

# Check node allocatable vs allocated
kubectl describe nodes | grep -A 8 "Allocated resources"

# Check for Karpenter/Cluster Autoscaler logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=50

Operational Patterns¶

Config rollout without downtime¶

ConfigMap updates don't restart pods automatically. You need to trigger a rollout:

# Option 1: restart the Deployment
kubectl rollout restart deployment/NAME

# Option 2: use a hash annotation (Helm does this automatically)
# In the Deployment template metadata:
#   checksum/config: {{ include (print .Template.BasePath "/configmap.yaml") . | sha256sum }}

Verifying Secret rotation¶

# Check when the Secret was last updated
kubectl get secret NAME -o jsonpath='{.metadata.resourceVersion}'

# Verify the pod picked up the new value (env var injection requires restart)
kubectl exec POD_NAME -- env | grep SECRET_KEY

Debugging resource pressure¶

# Which pods use the most resources?
kubectl top pods --sort-by=memory

# Which node is under pressure?
kubectl top nodes
kubectl describe node NODE | grep -A 3 Conditions

# Find pods without resource limits
kubectl get pods -o json | jq '.items[] |
  select(.spec.containers[].resources.limits == null) |
  .metadata.name'

The "It Works Locally but Not in K8s" Checklist¶

Image: is the image tag correct and pushed to the registry?
Port: does containerPort match what the app listens on?
Health check: do livenessProbe/readinessProbe paths return 200?
Config: are env vars and ConfigMaps mounted correctly?
DNS: can the pod resolve other services? (nslookup SVC.NS.svc.cluster.local)
Resources: does the pod have enough CPU/memory (not being throttled/OOM-killed)?
Permissions: does the ServiceAccount have the right RBAC?