Skip to content

K8s Concept Chain — Street-Level Ops

Troubleshooting each layer of the concept chain in production.


Quick Diagnosis: Which Layer Is Broken?

Run these in order when something isn't working:

# 1. Are pods running?
kubectl get pods -l app=NAME

# 2. Why isn't a pod running?
kubectl describe pod NAME
kubectl logs NAME --previous    # logs from the last crash

# 3. Does the Service have endpoints?
kubectl get endpoints NAME

# 4. Can you reach the Service from inside the cluster?
kubectl run debug --rm -it --image=nicolaka/netshoot -- curl http://NAME:PORT

# 5. Is Ingress configured and the controller healthy?
kubectl get ingress
kubectl get pods -n ingress-nginx

# 6. Are resources the bottleneck?
kubectl top pods
kubectl describe node NODE_NAME | grep -A 5 "Allocated resources"

Layer-by-Layer Troubleshooting

Pod won't start

Symptom Likely cause Check
ImagePullBackOff Wrong image name/tag, no pull secret kubectl describe pod — Events section
CrashLoopBackOff App crashes on startup kubectl logs NAME --previous
Pending No node with enough resources kubectl describe pod — look for "Insufficient cpu/memory"
ContainerCreating stuck Volume mount or secret missing kubectl describe pod — Events section

Service has no endpoints

# Check that selector labels match pod labels
kubectl get svc NAME -o yaml | grep -A 3 selector
kubectl get pods --show-labels | grep app=NAME

If labels don't match, the Service selector is wrong. Fix the Service or the pod template labels.

Ingress returns 404 or 503

# Verify Ingress Controller is running
kubectl get pods -n ingress-nginx

# Check Ingress rules
kubectl describe ingress NAME

# Test the backend Service directly
kubectl run debug --rm -it --image=nicolaka/netshoot -- \
  curl http://BACKEND_SERVICE:PORT

Common causes: wrong serviceName or servicePort in the Ingress spec, or the backend Service itself has no endpoints.

HPA not scaling

# Check HPA status and current metrics
kubectl get hpa NAME

# If TARGETS shows <unknown>
# → metrics-server is not installed or not working
kubectl get pods -n kube-system | grep metrics-server

Pods stuck Pending (node capacity)

# Check node allocatable vs allocated
kubectl describe nodes | grep -A 8 "Allocated resources"

# Check for Karpenter/Cluster Autoscaler logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=50

Operational Patterns

Config rollout without downtime

ConfigMap updates don't restart pods automatically. You need to trigger a rollout:

# Option 1: restart the Deployment
kubectl rollout restart deployment/NAME

# Option 2: use a hash annotation (Helm does this automatically)
# In the Deployment template metadata:
#   checksum/config: {{ include (print .Template.BasePath "/configmap.yaml") . | sha256sum }}

Verifying Secret rotation

# Check when the Secret was last updated
kubectl get secret NAME -o jsonpath='{.metadata.resourceVersion}'

# Verify the pod picked up the new value (env var injection requires restart)
kubectl exec POD_NAME -- env | grep SECRET_KEY

Debugging resource pressure

# Which pods use the most resources?
kubectl top pods --sort-by=memory

# Which node is under pressure?
kubectl top nodes
kubectl describe node NODE | grep -A 3 Conditions

# Find pods without resource limits
kubectl get pods -o json | jq '.items[] |
  select(.spec.containers[].resources.limits == null) |
  .metadata.name'

The "It Works Locally but Not in K8s" Checklist

  1. Image: is the image tag correct and pushed to the registry?
  2. Port: does containerPort match what the app listens on?
  3. Health check: do livenessProbe/readinessProbe paths return 200?
  4. Config: are env vars and ConfigMaps mounted correctly?
  5. DNS: can the pod resolve other services? (nslookup SVC.NS.svc.cluster.local)
  6. Resources: does the pod have enough CPU/memory (not being throttled/OOM-killed)?
  7. Permissions: does the ServiceAccount have the right RBAC?