Decision Tree: Service Returning 5xx Errors¶
Category: Incident Triage Starting Question: "My service is returning 5xx errors — where do I start?" Estimated traversal: 2-5 minutes Domains: kubernetes, networking, observability, linux-performance
The Tree¶
My service is returning 5xx errors — where do I start?
│
├── What is the error code?
│ │
│ ├── 502 / 503 / 504 (Gateway errors — proxy/ingress/LB is failing)
│ │ │
│ │ ├── Is it ALL traffic or only specific paths?
│ │ │ │
│ │ │ ├── All traffic
│ │ │ │ │
│ │ │ │ ├── Check pod count: `kubectl get pods -l app=<svc>`
│ │ │ │ │ │
│ │ │ │ │ ├── 0 running pods → ✅ ACTION: Fix Deployment (scale up / fix CrashLoop)
│ │ │ │ │ │
│ │ │ │ │ └── Pods exist but not Ready
│ │ │ │ │ │
│ │ │ │ │ ├── Check readiness probe: `kubectl describe pod <pod>`
│ │ │ │ │ │ └── Probe failing → ✅ ACTION: Fix Readiness Probe or App Startup
│ │ │ │ │ │
│ │ │ │ │ └── Service selector mismatch?
│ │ │ │ │ `kubectl get endpoints <svc>`
│ │ │ │ │ └── Empty ENDPOINTS → ✅ ACTION: Fix Service Selector
│ │ │ │ │
│ │ │ │ └── Is this a 504 specifically? (timeout)
│ │ │ │ │
│ │ │ │ ├── Check upstream latency vs configured timeout
│ │ │ │ │ `kubectl get ingress <name> -o yaml | grep timeout`
│ │ │ │ │ └── Latency > timeout → ✅ ACTION: Increase Timeout or Fix Upstream Latency
│ │ │ │ │
│ │ │ │ └── Check downstream dependency (DB / cache)
│ │ │ │ `kubectl exec -it <pod> -- curl -v http://db-service:5432`
│ │ │ │ └── Dependency unreachable → go to dependency branch
│ │ │ │
│ │ │ └── Specific paths only
│ │ │ │
│ │ │ ├── Check ingress routing rules
│ │ │ │ `kubectl get ingress <name> -o yaml`
│ │ │ │ └── Path not defined or wrong serviceName → ✅ ACTION: Fix Ingress Rule
│ │ │ │
│ │ │ └── Check if path-specific service exists and has endpoints
│ │ │ `kubectl get svc,endpoints -n <namespace>`
│ │ │ └── Missing service → ✅ ACTION: Deploy Missing Service
│ │ │
│ │ └── Check ingress controller logs
│ │ `kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100`
│ │ └── Upstream connection refused / reset → upstream pods are crashing
│ │ → See 500 branch below
│ │
│ └── 500 (Application error — the app itself is failing)
│ │
│ ├── Check application logs for exceptions
│ │ `kubectl logs <pod> --tail=200 | grep -i "error\|exception\|panic\|fatal"`
│ │ │
│ │ ├── Exception found with stack trace → ✅ ACTION: Fix Application Bug / Roll Back
│ │ │
│ │ ├── No logs at all → check if pod is restarting
│ │ │ `kubectl get pod <pod> -w`
│ │ │ └── Restart count rising → see pod-wont-start.md
│ │ │
│ │ └── "Connection refused" / timeout to dependency
│ │ │
│ │ ├── Is it DB? → Check DB health
│ │ │ `kubectl exec -it <pod> -- psql -h $DB_HOST -U $DB_USER -c '\l'`
│ │ │ └── Cannot connect → ✅ ACTION: Fix DB Connectivity / Connection Pool
│ │ │
│ │ └── Is it Redis/cache? → Check cache health
│ │ `kubectl exec -it <pod> -- redis-cli -h $REDIS_HOST ping`
│ │ └── No PONG → ✅ ACTION: Fix Cache Connectivity
│ │
│ ├── Was there a recent deployment?
│ │ `kubectl rollout history deployment/<name>`
│ │ │
│ │ ├── Yes, deployed in last 60 min → ✅ ACTION: Roll Back Deployment
│ │ │
│ │ └── No recent deploy → check config / secret changes
│ │ `kubectl describe deployment <name> | grep -A5 "Environment"`
│ │ └── Wrong env var / missing secret → ✅ ACTION: Fix ConfigMap / Secret
│ │
│ └── Are errors correlated with high load?
│ Check CPU/memory: `kubectl top pod -l app=<svc>`
│ │
│ ├── CPU throttling or memory near limit
│ │ └── ✅ ACTION: Scale Horizontally or Increase Limits
│ │
│ └── Resources look fine
│ └── ⚠️ ESCALATION: Engage App Owner for deeper profiling
Node Details¶
Check 1: Identify the error code¶
Command: kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=200 | grep -E " 5[0-9]{2} " or check your APM/metrics dashboard for status code breakdown.
What you're looking for: Whether errors are 502/503/504 (proxy layer, upstream unreachable) vs 500/5xx (application layer, upstream reachable but returning error).
Common pitfall: A 503 from nginx-ingress means nginx cannot reach any backend pod — not that the app returned 503. Look at the ingress controller logs, not just the app logs.
Check 2: Pod count and readiness¶
Command: kubectl get pods -l app=<service-name> -n <namespace> then kubectl get endpoints <service-name> -n <namespace>
What you're looking for: ENDPOINTS should list at least one ip:port. An empty endpoints list means the service has no ready pods to route to.
Common pitfall: Pods may show Running but not Ready (1/1 vs 0/1 in the READY column). Check kubectl describe pod for the readiness probe failure message.
Check 3: Service selector mismatch¶
Command: kubectl get svc <name> -o jsonpath='{.spec.selector}' then kubectl get pods -l <key>=<value>
What you're looking for: The selector labels on the Service must match labels on the pods. A single typo means zero endpoints.
Common pitfall: A deployment rollout can change pod labels (e.g., version label) without updating the service selector.
Check 4: Application logs¶
Command: kubectl logs <pod-name> --previous --tail=300 (use --previous if the current container just started after a crash)
What you're looking for: Stack traces, "connection refused", "timeout", "out of memory", "nil pointer dereference", or authentication errors to dependencies.
Common pitfall: Multi-container pods — specify -c <container> or you get the first container's logs by default, which may be a sidecar.
Check 5: Recent deployment¶
Command: kubectl rollout history deployment/<name> and kubectl describe deployment <name> | grep "last-applied"
What you're looking for: Timestamp of most recent rollout vs onset of 5xx errors.
Common pitfall: Canary or partial rollouts — some pods may run old code and some new. Compare logs from old vs new pods by checking their creationTimestamp.
Check 6: DB/cache connectivity¶
Command: kubectl exec -it <app-pod> -- sh -c 'nc -zv $DB_HOST 5432; echo exit=$?'
What you're looking for: exit=0 means network connectivity is fine; non-zero means the DB is unreachable from this pod.
Common pitfall: NetworkPolicy rules may be blocking the new pod's IP if it was recently rescheduled to a different node. See networkpolicy_block.md.
Terminal Actions¶
Action: Fix Service Selector¶
Do:
1. kubectl get svc <name> -o yaml > /tmp/svc-backup.yaml (save backup)
2. Identify correct pod labels: kubectl get pods -l app=<name> --show-labels
3. kubectl patch svc <name> -p '{"spec":{"selector":{"app":"<correct-value>"}}}'
4. Confirm: kubectl get endpoints <name>
Verify: Endpoints list shows pod IPs. Run a test request: curl -v http://<service-ip>/healthz
Runbook: ingress_404.md
Action: Roll Back Deployment¶
Do:
1. kubectl rollout undo deployment/<name>
2. Monitor: kubectl rollout status deployment/<name>
3. Confirm error rate drops in metrics dashboard
Verify: kubectl get pods -l app=<name> shows new pods all Ready; 5xx rate returns to baseline.
Runbook: helm_upgrade_failed.md
Action: Fix Readiness Probe or App Startup¶
Do:
1. kubectl describe pod <pod> — find the readiness probe definition and last failure message
2. Test the probe manually: kubectl exec -it <pod> -- curl -v http://localhost:<port><path>
3. If probe path is wrong, edit deployment: kubectl edit deployment <name>
4. If app is slow to start, increase initialDelaySeconds in the probe spec
Verify: Pod transitions to Ready (1/1). kubectl get pods -l app=<name>
Runbook: readiness_probe_failed.md
Action: Increase Timeout or Fix Upstream Latency¶
Do:
1. Identify current timeout: kubectl get ingress <name> -o yaml | grep -i timeout
2. Annotate ingress: kubectl annotate ingress <name> nginx.ingress.kubernetes.io/proxy-read-timeout="120"
3. Investigate why upstream is slow — check latency metrics and slow query logs
Verify: 504 rate drops. Upstream p99 latency is within timeout.
Runbook: hpa_not_scaling.md
Action: Fix DB Connectivity / Connection Pool¶
Do:
1. Verify DB pod/service is up: kubectl get pods -n <db-namespace>
2. Check connection pool exhaustion in app logs: look for "too many connections" or "pool timeout"
3. If pool exhausted: restart app pods to drain idle connections, then tune max_pool_size
4. If DB is down: see PostgreSQL runbook
Verify: App logs show successful DB connections. 500 error rate drops.
Action: Scale Horizontally or Increase Limits¶
Do:
1. Check current HPA: kubectl get hpa
2. If no HPA, manual scale: kubectl scale deployment <name> --replicas=<N>
3. If CPU-throttled, increase limits: kubectl set resources deployment <name> --limits=cpu=500m,memory=512Mi
Verify: kubectl top pods -l app=<name> shows CPU/memory below 80% of limits.
Escalation: Engage App Owner for Deeper Profiling¶
When: 500 errors persist after rollback, resources are healthy, logs show no clear root cause. Who: Application development team / service owner Include in page: Error rate (req/s), onset time, pod names currently affected, last 50 lines of app logs, recent deployment history
Edge Cases¶
- Intermittent 5xx on only 1 of N pods: One pod may have a corrupted local state or a different code version. Cordon and delete the specific pod rather than rolling back the whole deployment.
- 5xx only during peak traffic: Connection pool or file descriptor exhaustion. Check
ulimit -ninside the pod and DBmax_connections. - 5xx after cert rotation: If your app validates TLS to a dependency, a new cert may not be trusted. Check
cert_renewal_failed.md. - 5xx from a Helm-deployed service after upgrade: Helm may have partially applied — check
helm status <release>for failed hooks. Seehelm_upgrade_failed.md. - 503 from service mesh (Istio): Envoy sidecar returning 503 is distinct from the app returning 503. Check
istioctl proxy-statusandkubectl logs <pod> -c istio-proxy.
Cross-References¶
- Topic Packs: k8s-services-and-ingress, k8s-ops (Probes), observability-deep-dive, networking
- Runbooks: ingress_404.md, helm_upgrade_failed.md, crashloopbackoff.md, networkpolicy_block.md, cert_renewal_failed.md