Skip to content

Decision Tree: Service Returning 5xx Errors

Category: Incident Triage Starting Question: "My service is returning 5xx errors — where do I start?" Estimated traversal: 2-5 minutes Domains: kubernetes, networking, observability, linux-performance


The Tree

My service is returning 5xx errors  where do I start?
├── What is the error code?
      ├── 502 / 503 / 504 (Gateway errors  proxy/ingress/LB is failing)
            ├── Is it ALL traffic or only specific paths?
                  ├── All traffic
                        ├── Check pod count: `kubectl get pods -l app=<svc>`
                              ├── 0 running pods   ACTION: Fix Deployment (scale up / fix CrashLoop)
                              └── Pods exist but not Ready
                                      ├── Check readiness probe: `kubectl describe pod <pod>`
                      └── Probe failing   ACTION: Fix Readiness Probe or App Startup
                                      └── Service selector mismatch?
                       `kubectl get endpoints <svc>`
                       └── Empty ENDPOINTS   ACTION: Fix Service Selector
                        └── Is this a 504 specifically? (timeout)
                                ├── Check upstream latency vs configured timeout
                   `kubectl get ingress <name> -o yaml | grep timeout`
                   └── Latency > timeout   ACTION: Increase Timeout or Fix Upstream Latency
                                └── Check downstream dependency (DB / cache)
                    `kubectl exec -it <pod> -- curl -v http://db-service:5432`
                    └── Dependency unreachable  go to dependency branch
                  └── Specific paths only
                          ├── Check ingress routing rules
                `kubectl get ingress <name> -o yaml`
                └── Path not defined or wrong serviceName   ACTION: Fix Ingress Rule
                          └── Check if path-specific service exists and has endpoints
                 `kubectl get svc,endpoints -n <namespace>`
                 └── Missing service   ACTION: Deploy Missing Service
            └── Check ingress controller logs
          `kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100`
          └── Upstream connection refused / reset  upstream pods are crashing
               See 500 branch below
      └── 500 (Application error  the app itself is failing)
              ├── Check application logs for exceptions
          `kubectl logs <pod> --tail=200 | grep -i "error\|exception\|panic\|fatal"`
                    ├── Exception found with stack trace   ACTION: Fix Application Bug / Roll Back
                    ├── No logs at all  check if pod is restarting
             `kubectl get pod <pod> -w`
             └── Restart count rising  see pod-wont-start.md
                    └── "Connection refused" / timeout to dependency
                            ├── Is it DB?  Check DB health
                 `kubectl exec -it <pod> -- psql -h $DB_HOST -U $DB_USER -c '\l'`
                 └── Cannot connect   ACTION: Fix DB Connectivity / Connection Pool
                            └── Is it Redis/cache?  Check cache health
                  `kubectl exec -it <pod> -- redis-cli -h $REDIS_HOST ping`
                  └── No PONG   ACTION: Fix Cache Connectivity
              ├── Was there a recent deployment?
          `kubectl rollout history deployment/<name>`
                    ├── Yes, deployed in last 60 min   ACTION: Roll Back Deployment
                    └── No recent deploy  check config / secret changes
              `kubectl describe deployment <name> | grep -A5 "Environment"`
              └── Wrong env var / missing secret   ACTION: Fix ConfigMap / Secret
              └── Are errors correlated with high load?
           Check CPU/memory: `kubectl top pod -l app=<svc>`
                      ├── CPU throttling or memory near limit
              └──  ACTION: Scale Horizontally or Increase Limits
                      └── Resources look fine
               └── ⚠️ ESCALATION: Engage App Owner for deeper profiling

Node Details

Check 1: Identify the error code

Command: kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=200 | grep -E " 5[0-9]{2} " or check your APM/metrics dashboard for status code breakdown. What you're looking for: Whether errors are 502/503/504 (proxy layer, upstream unreachable) vs 500/5xx (application layer, upstream reachable but returning error). Common pitfall: A 503 from nginx-ingress means nginx cannot reach any backend pod — not that the app returned 503. Look at the ingress controller logs, not just the app logs.

Check 2: Pod count and readiness

Command: kubectl get pods -l app=<service-name> -n <namespace> then kubectl get endpoints <service-name> -n <namespace> What you're looking for: ENDPOINTS should list at least one ip:port. An empty endpoints list means the service has no ready pods to route to. Common pitfall: Pods may show Running but not Ready (1/1 vs 0/1 in the READY column). Check kubectl describe pod for the readiness probe failure message.

Check 3: Service selector mismatch

Command: kubectl get svc <name> -o jsonpath='{.spec.selector}' then kubectl get pods -l <key>=<value> What you're looking for: The selector labels on the Service must match labels on the pods. A single typo means zero endpoints. Common pitfall: A deployment rollout can change pod labels (e.g., version label) without updating the service selector.

Check 4: Application logs

Command: kubectl logs <pod-name> --previous --tail=300 (use --previous if the current container just started after a crash) What you're looking for: Stack traces, "connection refused", "timeout", "out of memory", "nil pointer dereference", or authentication errors to dependencies. Common pitfall: Multi-container pods — specify -c <container> or you get the first container's logs by default, which may be a sidecar.

Check 5: Recent deployment

Command: kubectl rollout history deployment/<name> and kubectl describe deployment <name> | grep "last-applied" What you're looking for: Timestamp of most recent rollout vs onset of 5xx errors. Common pitfall: Canary or partial rollouts — some pods may run old code and some new. Compare logs from old vs new pods by checking their creationTimestamp.

Check 6: DB/cache connectivity

Command: kubectl exec -it <app-pod> -- sh -c 'nc -zv $DB_HOST 5432; echo exit=$?' What you're looking for: exit=0 means network connectivity is fine; non-zero means the DB is unreachable from this pod. Common pitfall: NetworkPolicy rules may be blocking the new pod's IP if it was recently rescheduled to a different node. See networkpolicy_block.md.


Terminal Actions

Action: Fix Service Selector

Do: 1. kubectl get svc <name> -o yaml > /tmp/svc-backup.yaml (save backup) 2. Identify correct pod labels: kubectl get pods -l app=<name> --show-labels 3. kubectl patch svc <name> -p '{"spec":{"selector":{"app":"<correct-value>"}}}' 4. Confirm: kubectl get endpoints <name> Verify: Endpoints list shows pod IPs. Run a test request: curl -v http://<service-ip>/healthz Runbook: ingress_404.md

Action: Roll Back Deployment

Do: 1. kubectl rollout undo deployment/<name> 2. Monitor: kubectl rollout status deployment/<name> 3. Confirm error rate drops in metrics dashboard Verify: kubectl get pods -l app=<name> shows new pods all Ready; 5xx rate returns to baseline. Runbook: helm_upgrade_failed.md

Action: Fix Readiness Probe or App Startup

Do: 1. kubectl describe pod <pod> — find the readiness probe definition and last failure message 2. Test the probe manually: kubectl exec -it <pod> -- curl -v http://localhost:<port><path> 3. If probe path is wrong, edit deployment: kubectl edit deployment <name> 4. If app is slow to start, increase initialDelaySeconds in the probe spec Verify: Pod transitions to Ready (1/1). kubectl get pods -l app=<name> Runbook: readiness_probe_failed.md

Action: Increase Timeout or Fix Upstream Latency

Do: 1. Identify current timeout: kubectl get ingress <name> -o yaml | grep -i timeout 2. Annotate ingress: kubectl annotate ingress <name> nginx.ingress.kubernetes.io/proxy-read-timeout="120" 3. Investigate why upstream is slow — check latency metrics and slow query logs Verify: 504 rate drops. Upstream p99 latency is within timeout. Runbook: hpa_not_scaling.md

Action: Fix DB Connectivity / Connection Pool

Do: 1. Verify DB pod/service is up: kubectl get pods -n <db-namespace> 2. Check connection pool exhaustion in app logs: look for "too many connections" or "pool timeout" 3. If pool exhausted: restart app pods to drain idle connections, then tune max_pool_size 4. If DB is down: see PostgreSQL runbook Verify: App logs show successful DB connections. 500 error rate drops.

Action: Scale Horizontally or Increase Limits

Do: 1. Check current HPA: kubectl get hpa 2. If no HPA, manual scale: kubectl scale deployment <name> --replicas=<N> 3. If CPU-throttled, increase limits: kubectl set resources deployment <name> --limits=cpu=500m,memory=512Mi Verify: kubectl top pods -l app=<name> shows CPU/memory below 80% of limits.

Escalation: Engage App Owner for Deeper Profiling

When: 500 errors persist after rollback, resources are healthy, logs show no clear root cause. Who: Application development team / service owner Include in page: Error rate (req/s), onset time, pod names currently affected, last 50 lines of app logs, recent deployment history


Edge Cases

  • Intermittent 5xx on only 1 of N pods: One pod may have a corrupted local state or a different code version. Cordon and delete the specific pod rather than rolling back the whole deployment.
  • 5xx only during peak traffic: Connection pool or file descriptor exhaustion. Check ulimit -n inside the pod and DB max_connections.
  • 5xx after cert rotation: If your app validates TLS to a dependency, a new cert may not be trusted. Check cert_renewal_failed.md.
  • 5xx from a Helm-deployed service after upgrade: Helm may have partially applied — check helm status <release> for failed hooks. See helm_upgrade_failed.md.
  • 503 from service mesh (Istio): Envoy sidecar returning 503 is distinct from the app returning 503. Check istioctl proxy-status and kubectl logs <pod> -c istio-proxy.

Cross-References