Troubleshooting Flows Cheat Sheet¶

Quick decision trees for common DevOps problems. Follow the arrows.

Remember: Exit codes tell you how a container died. The critical ones: 0 = clean exit, 1 = application error, 137 = OOMKilled or SIGKILL (128+9), 139 = segfault (128+11), 143 = SIGTERM (128+15). Any exit code > 128 means the process was killed by a signal: subtract 128 to get the signal number.

Pod Not Starting¶

Pod status?
├── Pending
│   ├── "Insufficient cpu/memory" → Scale nodes or reduce requests
│   ├── "No nodes match" → Check nodeSelector, tolerations, affinity
│   ├── "PVC not bound" → Check StorageClass, PV availability
│   └── "Unschedulable" → Check taints, cordoned nodes
├── ContainerCreating
│   ├── Stuck mounting → Check PVC, CSI driver, node storage
│   └── Stuck pulling → Check image name, registry auth, network
├── CrashLoopBackOff
│   ├── Exit code 1 → App error: kubectl logs --previous
│   ├── Exit code 137 → OOMKilled: increase memory limit
│   ├── Exit code 139 → Segfault: debug binary/dependencies
│   └── Immediate crash → Wrong CMD, missing config/env
├── ImagePullBackOff
│   ├── 401 Unauthorized → Check imagePullSecrets
│   ├── Not found → Check image:tag spelling
│   └── Timeout → Check network/registry availability
└── Running but not Ready
    └── Readiness probe failing → Check probe path, port, app health

Service Not Reachable¶

kubectl get endpoints <svc>
├── Endpoints empty?
│   ├── Yes → Labels don't match: compare svc selector with pod labels
│   └── No → Endpoints exist, continue...
│
kubectl exec debug-pod -- curl <svc>:<port>
├── Connection refused?
│   └── App not listening on that port: kubectl exec <pod> -- ss -tlnp
├── Connection timed out?
│   ├── NetworkPolicy blocking? → kubectl get netpol -n <ns>
│   └── Wrong port in Service spec? → Check targetPort
├── DNS not resolving?
│   ├── CoreDNS running? → kubectl get pods -n kube-system -l k8s-app=kube-dns
│   └── Check resolv.conf → kubectl exec <pod> -- cat /etc/resolv.conf
└── Works internally but not externally?
    ├── Ingress misconfigured → Check rules, host, path
    ├── LoadBalancer pending → Check cloud provider, annotations
    └── NodePort → Check security groups allow the port

High Error Rate¶

Where are errors?
├── Application level (5xx)
│   ├── Check app logs: kubectl logs <pod> --tail=100
│   ├── Recent deployment? → kubectl rollout history deploy/<name>
│   ├── Resource pressure? → kubectl top pods
│   └── Dependency down? → Check downstream services
├── Infrastructure level
│   ├── Node not ready → kubectl describe node <name>
│   ├── Disk pressure → df -h on node
│   ├── Memory pressure → free -h on node
│   └── Network issues → Check CNI, kube-proxy
└── Ingress/Load Balancer
    ├── 502/503 → Backend pods not ready or restarting
    ├── 504 → Timeout: upstream too slow, increase timeout annotation
    └── 429 → Rate limiting: check ingress rate-limit annotations

Debug clue: When facing 502/503 errors at an Ingress/Load Balancer, always check kubectl get pods -w — if pods are rapidly restarting (CrashLoopBackOff), the LB sees healthy → unhealthy → healthy cycling, which produces intermittent 502s. The fix is in the application, not the load balancer.

High Latency¶

Where is latency?
├── At the app?
│   ├── CPU throttled? → Check limits vs usage (kubectl top)
│   ├── GC pauses? → Check app-specific metrics
│   ├── DB queries slow? → Check connection pool, query patterns
│   └── External API slow? → Check dependency latency metrics
├── At the network?
│   ├── Cross-AZ traffic? → Check pod/service topology
│   ├── DNS slow? → Time DNS lookups, check ndots setting
│   ├── Service mesh overhead? → Check Envoy sidecar metrics
│   └── kube-proxy / iptables? → Check rule count, IPVS mode
└── At the infrastructure?
    ├── Disk I/O saturated? → iostat, check PV performance
    ├── Node overcommitted? → kubectl describe node (allocated %)
    └── etcd slow? → Check etcd latency metrics

Deployment Rollback Decision¶

Deployment failing?
├── New pods not starting?
│   ├── Image issue → Fix image, redeploy
│   └── Config issue → Fix configmap/secret, redeploy
├── New pods starting but unhealthy?
│   ├── Canary/progressive? → Route back to stable
│   └── Rolling update? → kubectl rollout undo deploy/<name>
├── Partial rollout stuck?
│   ├── maxUnavailable reached → Old pods can't terminate
│   └── PDB blocking → Check PodDisruptionBudget
└── Helm release failed?
    ├── helm rollback <release> <revision>
    ├── Check: helm history <release>
    └── Stuck pending? → kubectl delete secret -l status=pending-install

Certificate Issues¶

TLS error?
├── Certificate expired
│   ├── Check expiry: openssl s_client -connect host:443 | openssl x509 -noout -dates
│   ├── cert-manager? → Check Certificate and CertificateRequest resources
│   └── Manual? → Renew and redeploy secret
├── Chain incomplete
│   ├── Missing intermediate cert → Add to tls.crt in correct order
│   └── openssl verify -CAfile ca.pem -untrusted intermediate.pem cert.pem
├── Name mismatch
│   ├── CN/SAN doesn't match hostname → Reissue with correct names
│   └── Check: openssl x509 -in cert.pem -noout -text | grep -A1 "Subject Alternative Name"
└── Connection reset/timeout
    └── TLS not enabled on backend → Check service port name (must include "https")

Cost Spike Investigation¶

Bill increased?
├── Which service?
│   ├── Compute → Check instance count, types, idle resources
│   ├── Data transfer → NAT Gateway, cross-AZ, egress
│   ├── Storage → EBS snapshots, S3 growth, orphaned volumes
│   └── Database → Oversized RDS, read replicas, backup retention
├── When did it start?
│   ├── Correlate with deployments/scaling events
│   └── Check auto-scaling groups — did max increase?
└── Quick wins
    ├── Delete orphaned resources (unattached EBS, unused EIPs)
    ├── Right-size underutilized instances
    ├── Schedule dev/test shutdowns
    └── Enable S3 lifecycle policies

Gotcha: Kubernetes DNS uses ndots: 5 by default, which means any name with fewer than 5 dots gets the search domains appended first. A lookup for api.example.com (2 dots) generates 4-5 DNS queries before the final FQDN query succeeds. For latency-sensitive services, add dnsConfig.options: [{name: ndots, value: "2"}] to the pod spec, or use FQDNs with a trailing dot (api.example.com.).

Node Issues¶

Node NotReady?
├── Check conditions: kubectl describe node <name>
│   ├── MemoryPressure → Evict pods, add memory
│   ├── DiskPressure → Clean up images/logs
│   ├── PIDPressure → Find process leak
│   └── NetworkUnavailable → Check CNI plugin
├── kubelet not running?
│   ├── systemctl status kubelet
│   ├── journalctl -u kubelet --since "5 min ago"
│   └── Check certificates: ls -la /var/lib/kubelet/pki/
├── Node unreachable?
│   ├── Instance running? (cloud console)
│   ├── Network: security groups, NACLs
│   └── BMC/IPMI: check hardware health
└── Recovery
    ├── Drain: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
    ├── Fix issue on node
    └── Uncordon: kubectl uncordon <node>