K8s Concept Chain — Street-Level Ops¶
Troubleshooting each layer of the concept chain in production.
Quick Diagnosis: Which Layer Is Broken?¶
Run these in order when something isn't working:
# 1. Are pods running?
kubectl get pods -l app=NAME
# 2. Why isn't a pod running?
kubectl describe pod NAME
kubectl logs NAME --previous # logs from the last crash
# 3. Does the Service have endpoints?
kubectl get endpoints NAME
# 4. Can you reach the Service from inside the cluster?
kubectl run debug --rm -it --image=nicolaka/netshoot -- curl http://NAME:PORT
# 5. Is Ingress configured and the controller healthy?
kubectl get ingress
kubectl get pods -n ingress-nginx
# 6. Are resources the bottleneck?
kubectl top pods
kubectl describe node NODE_NAME | grep -A 5 "Allocated resources"
Layer-by-Layer Troubleshooting¶
Pod won't start¶
| Symptom | Likely cause | Check |
|---|---|---|
ImagePullBackOff |
Wrong image name/tag, no pull secret | kubectl describe pod — Events section |
CrashLoopBackOff |
App crashes on startup | kubectl logs NAME --previous |
Pending |
No node with enough resources | kubectl describe pod — look for "Insufficient cpu/memory" |
ContainerCreating stuck |
Volume mount or secret missing | kubectl describe pod — Events section |
Service has no endpoints¶
# Check that selector labels match pod labels
kubectl get svc NAME -o yaml | grep -A 3 selector
kubectl get pods --show-labels | grep app=NAME
If labels don't match, the Service selector is wrong. Fix the Service or the pod template labels.
Ingress returns 404 or 503¶
# Verify Ingress Controller is running
kubectl get pods -n ingress-nginx
# Check Ingress rules
kubectl describe ingress NAME
# Test the backend Service directly
kubectl run debug --rm -it --image=nicolaka/netshoot -- \
curl http://BACKEND_SERVICE:PORT
Common causes: wrong serviceName or servicePort in the Ingress spec, or
the backend Service itself has no endpoints.
HPA not scaling¶
# Check HPA status and current metrics
kubectl get hpa NAME
# If TARGETS shows <unknown>
# → metrics-server is not installed or not working
kubectl get pods -n kube-system | grep metrics-server
Pods stuck Pending (node capacity)¶
# Check node allocatable vs allocated
kubectl describe nodes | grep -A 8 "Allocated resources"
# Check for Karpenter/Cluster Autoscaler logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=50
Operational Patterns¶
Config rollout without downtime¶
ConfigMap updates don't restart pods automatically. You need to trigger a rollout:
# Option 1: restart the Deployment
kubectl rollout restart deployment/NAME
# Option 2: use a hash annotation (Helm does this automatically)
# In the Deployment template metadata:
# checksum/config: {{ include (print .Template.BasePath "/configmap.yaml") . | sha256sum }}
Verifying Secret rotation¶
# Check when the Secret was last updated
kubectl get secret NAME -o jsonpath='{.metadata.resourceVersion}'
# Verify the pod picked up the new value (env var injection requires restart)
kubectl exec POD_NAME -- env | grep SECRET_KEY
Debugging resource pressure¶
# Which pods use the most resources?
kubectl top pods --sort-by=memory
# Which node is under pressure?
kubectl top nodes
kubectl describe node NODE | grep -A 3 Conditions
# Find pods without resource limits
kubectl get pods -o json | jq '.items[] |
select(.spec.containers[].resources.limits == null) |
.metadata.name'
The "It Works Locally but Not in K8s" Checklist¶
- Image: is the image tag correct and pushed to the registry?
- Port: does
containerPortmatch what the app listens on? - Health check: do
livenessProbe/readinessProbepaths return 200? - Config: are env vars and ConfigMaps mounted correctly?
- DNS: can the pod resolve other services? (
nslookup SVC.NS.svc.cluster.local) - Resources: does the pod have enough CPU/memory (not being throttled/OOM-killed)?
- Permissions: does the ServiceAccount have the right RBAC?