k8s
l2
runbook
k8s-networking
k8s-services-ingress --- Portal | Level: L2: Operations | Topics: Kubernetes Networking, Kubernetes Services & Ingress | Domain: Kubernetes

Runbook: Ingress 502 Bad Gateway¶

Field	Value
Domain	Kubernetes
Alert	`nginx_ingress_controller_requests{status="502"} > threshold` or user reports
Severity	P1
Est. Resolution Time	15-30 minutes
Escalation Timeout	20 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	kubectl access, cluster-admin or namespace-admin, kubeconfig configured

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
kubectl get ingress -A -o wide

If output shows: multiple ingresses affected → The ingress controller itself may be down — jump to Step 2 immediately If output shows: a single ingress returning 502 → Continue with steps below to narrow down to backend or ingress config

Step 1: Confirm 502 Scope — All Paths or Specific¶

Why: A 502 on one path means one backend service is broken. A 502 on all paths of an ingress usually means the ingress controller cannot reach any backend, which is a different problem.

# Test from inside the cluster to isolate whether it is the ingress or backend
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -v http://<INGRESS_HOST>/<PROBLEM_PATH>

# Check the ingress rules to see which services are backend
kubectl describe ingress <INGRESS_NAME> -n <NAMESPACE>

Expected output (502 response):

< HTTP/1.1 502 Bad Gateway
< Content-Type: text/html

Expected output (ingress rules):

Rules:
  Host           Path  Backends
  ----           ----  --------
  api.example.com
                 /users   user-service:8080 (10.0.1.5:8080)
                 /orders  order-service:8080 (No Endpoints!)   <-- problem here

If specific paths show No Endpoints: The backend service has no ready pods — go to Step 4. If all paths are affected: Go to Step 2 to check the ingress controller itself.

Step 2: Check Ingress Controller Logs¶

Why: The NGINX ingress controller logs the exact reason for each 502, including upstream connection refusals, timeouts, and connect errors. This is the fastest path to root cause.

# Find the ingress controller pod
kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Tail the logs and filter for 502s and errors
kubectl logs -n ingress-nginx <INGRESS_CONTROLLER_POD> --tail=100 | grep -E "502|upstream|error|connect"

# If there are multiple controller pods, check all of them
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50 | grep -E "502|upstream|error"

Expected output (upstream connection refused):

2026/03/19 10:00:00 [error] 29#29: *1234 connect() failed (111: Connection refused) while connecting to upstream,
  client: 10.0.0.1, server: api.example.com, request: "GET /orders HTTP/1.1", upstream: "http://10.0.1.8:8080/orders"

If logs show connect() failed (111: Connection refused): The backend pod is up but the port is wrong, or the pod is not listening on that port — continue to Step 3. If logs show upstream timed out: The backend is reachable but slow — check application performance and see Step 6 for timeout tuning. If the ingress controller pod is not running: Restart it with kubectl rollout restart deployment/ingress-nginx-controller -n ingress-nginx.

Step 3: Check Backend Service Endpoints¶

Why: A Kubernetes Service with no Endpoints means no ready pods are backing it. The ingress controller returns 502 when it cannot forward to any endpoint.

# Check if the Service has endpoints
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>

# More detail on the service
kubectl describe service <SERVICE_NAME> -n <NAMESPACE>

Expected output (healthy service with endpoints):

NAME           ENDPOINTS                         AGE
order-service  10.0.1.5:8080,10.0.1.6:8080      2d

Expected output (broken — no endpoints):

NAME           ENDPOINTS   AGE
order-service  <none>      2d

If endpoints is <none>: The Service's selector does not match any running pods — continue to Step 4. If endpoints exist but 502 still occurs: The pods are registered but not accepting connections — check pod readiness (Step 4) and port mappings.

Step 4: Check Pod Readiness¶

Why: Kubernetes removes pods from Service endpoints when their readiness probe fails. If all pods fail readiness, the Service has no endpoints and all traffic returns 502. This is distinct from pods being Down — they may be running but not ready.

# Check pod readiness status
kubectl get pods -n <NAMESPACE> -l <SERVICE_SELECTOR_LABEL>=<SERVICE_SELECTOR_VALUE> -o wide

# Check readiness probe configuration
kubectl describe deployment <DEPLOYMENT_NAME> -n <NAMESPACE> | grep -A 10 "Readiness"

# Check recent events for readiness failures
kubectl get events -n <NAMESPACE> --field-selector reason=Unhealthy --sort-by='.lastTimestamp' | tail -20

Expected output (readiness failure events):

Warning   Unhealthy   45s   kubelet
  Readiness probe failed: HTTP probe failed with statuscode: 503

If readiness probes are failing: The application is running but not yet healthy (perhaps waiting for a database connection or a cache warmup). Check application logs:

kubectl logs <POD_NAME> -n <NAMESPACE> --tail=50

If the application is waiting on a dependency (DB, cache, upstream API): Check that dependency first — the 502 is a symptom of a deeper problem.

Step 5: Check Upstream Service Health¶

Why: The backend pods may be healthy, but if they depend on a database, cache, or third-party API that is down, they will return errors that manifest as 502s.

# Test the backend service directly, bypassing the ingress
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -v http://<SERVICE_NAME>.<NAMESPACE>.svc.cluster.local:<PORT>/<PATH>

# Check if the service's own dependencies are up
kubectl get pods -n <DEPENDENCY_NAMESPACE> | grep <DEPENDENCY_APP>

# Check logs for dependency connection errors
kubectl logs <POD_NAME> -n <NAMESPACE> --tail=100 | grep -iE "connect|refused|timeout|database|redis|unavailable"

Expected output (dependency down — logs):

Error: ECONNREFUSED - connect ECONNREFUSED 10.0.2.5:5432

If a dependency is down: Resolve the dependency first. The 502s will clear automatically once the backend pods pass their readiness probes. If this fails: The problem may be with the application code itself — escalate to the development team.

Step 6: Check Proxy Timeouts¶

Why: If the backend service is slow (e.g., due to a slow database query or large payload), the default NGINX proxy timeouts (60s for read, 60s for connect) may be too short, causing NGINX to abort the connection and return 502 before the backend responds.

# Check current ingress annotations for timeout settings
kubectl get ingress <INGRESS_NAME> -n <NAMESPACE> -o yaml | grep -i timeout

# Common timeout annotations for nginx ingress:
# nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
# nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
# nginx.ingress.kubernetes.io/proxy-read-timeout: "60"

# Patch if timeouts are too short (example: extend read timeout to 120s)
kubectl annotate ingress <INGRESS_NAME> -n <NAMESPACE> \
  nginx.ingress.kubernetes.io/proxy-read-timeout="120" --overwrite

Expected output (annotation applied):

ingress.networking.k8s.io/<INGRESS_NAME> annotated

If 502s stop after extending timeouts: The backend is too slow — this is a performance issue that needs a longer-term fix. Notify the development team. If 502s continue after extending timeouts: The backend is not responding at all — return to Step 4/5.

Verification¶

# Confirm the issue is resolved
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -v http://<INGRESS_HOST>/<PROBLEM_PATH>
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>

Success looks like: HTTP 200 response from the test pod, endpoints show healthy pod IPs, and ingress controller logs show no new 502 errors. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 20 min	SRE on-call	"Kubernetes Ingress 502 in , ingress , backend unresponsive, runbook exhausted"
Data loss suspected	Platform Lead	"Data loss risk: in-flight requests to returning 502, requests not being processed"
Scope expanding beyond namespace	Platform team	"Multi-namespace impact: ingress controller returning 502 for all hosts, controller may be down"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Add readiness probe health check for the failing dependency if that was the root cause
Review timeout settings to ensure they match actual backend SLAs

Common Mistakes¶

Checking ingress config when the problem is the backend pods: The ingress configuration (annotations, rules) almost never changes on its own. When a previously working ingress starts returning 502, the ingress config is rarely the problem. Go straight to checking Service endpoints (Step 3) and pod readiness (Step 4) before spending time reviewing ingress YAML.
Ignoring readiness probe failures: A pod that is Running but not Ready has been removed from Service endpoints by design. Engineers see Running in kubectl get pods and conclude the pod is fine, then waste time debugging the ingress or network. Always check the READY column — 0/1 means the pod is explicitly excluded from the load balancer.
Not testing the backend service directly: Before concluding the ingress is the problem, always test the backend service directly from inside the cluster (using kubectl run). This confirms whether the 502 is at the ingress layer or the backend layer, cutting diagnosis time in half.

Cross-References¶

Survival Guide: On-Call Survival Guide (pocket card version)
Topic Pack: Kubernetes Topics (deep background)
Related Runbook: pod-crashloop.md — if backend pods are crashing
Related Runbook: deploy-stuck.md — if a recent deployment broke the backend
Related Runbook: oom-kill.md — if backend pods are being OOMKilled under load

Kubernetes Services & Ingress (Topic Pack, L1) — Kubernetes Networking, Kubernetes Services & Ingress
API Gateways & Ingress (Topic Pack, L2) — Kubernetes Networking
Case Study: CNI Broken After Restart (Case Study, L2) — Kubernetes Networking
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Networking
Case Study: CoreDNS Timeout Pod DNS (Case Study, L2) — Kubernetes Networking
Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Kubernetes Networking
Case Study: Service Mesh 503s — Envoy Misconfigured, RBAC Policy (Case Study, L2) — Kubernetes Networking
Case Study: Service No Endpoints (Case Study, L1) — Kubernetes Networking
Cilium & eBPF Networking (Topic Pack, L2) — Kubernetes Networking
Deep Dive: Kubernetes Networking (deep_dive, L2) — Kubernetes Networking

Runbook: Ingress 502 Bad Gateway¶

Quick Assessment (30 seconds)¶

Step 1: Confirm 502 Scope — All Paths or Specific¶

Step 2: Check Ingress Controller Logs¶

Step 3: Check Backend Service Endpoints¶

Step 4: Check Pod Readiness¶

Step 5: Check Upstream Service Health¶

Step 6: Check Proxy Timeouts¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Ingress 502 Bad Gateway¶

Quick Assessment (30 seconds)¶

Step 1: Confirm 502 Scope — All Paths or Specific¶

Step 2: Check Ingress Controller Logs¶

Step 3: Check Backend Service Endpoints¶

Step 4: Check Pod Readiness¶

Step 5: Check Upstream Service Health¶

Step 6: Check Proxy Timeouts¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶