Portal | Level: L2: Operations | Topics: Kubernetes Networking | Domain: Kubernetes
Scenario: Ingress Returns 404 Intermittently¶
The Prompt¶
"Users report getting 404 errors intermittently when accessing our app through the ingress. Some requests work, some don't. The app pods are all running. What's going on?"
Initial Report¶
Customer support escalation: "Multiple customers are reporting that the app loads sometimes but gives a 'Not Found' page other times. It seems random. Started about 30 minutes ago."
Constraints¶
- Time pressure: You have 15 minutes before the next escalation. Customer-facing errors are actively occurring.
- Limited access: You have read access to the ingress and application namespace. Modifying the ingress controller (Traefik in kube-system) requires platform team involvement.
Observable Evidence¶
- Dashboard: HTTP 404 rate is at ~30% of total requests. 200 responses still account for ~70%. Error rate correlates with specific backend pod IPs.
- Ingress controller logs: Traefik logs show
404 page not foundresponses originating from one specific upstream pod IP. - Endpoints:
kubectl get endpoints grokdevopsshows 3 pod IPs, but one pod was recently restarted and may be serving stale routes.
Expected Investigation Path¶
# 1. Check ingress config
kubectl get ingress -n grokdevops -o yaml
# 2. Check endpoints — are all pods in the endpoint list?
kubectl get endpoints grokdevops -n grokdevops
kubectl get pods -n grokdevops -o wide
# 3. Check if some pods are not Ready
kubectl get pods -n grokdevops -o custom-columns='NAME:.metadata.name,READY:.status.containerStatuses[0].ready,IP:.status.podIP'
# 4. Check ingress controller logs
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik --tail=50
# 5. Test individual pods
for ip in $(kubectl get endpoints grokdevops -n grokdevops -o jsonpath='{.subsets[0].addresses[*].ip}'); do
echo "Testing $ip:"
kubectl run curl-test-$RANDOM -n grokdevops --rm -i --restart=Never --image=curlimages/curl -- curl -s http://$ip:8000/health
done
Strong Answer¶
"Intermittent 404s with the ingress suggest the ingress controller is load-balancing across pods, and some of them are returning 404. This could mean: (1) some pods are running a different version with different routes — check if a rolling update is in progress; (2) the readiness probe is too lenient — pods are marked Ready before routes are registered; (3) one pod has a corrupted or misconfigured state. I'd check the endpoint list to see which pod IPs are included, then test each pod directly. If only some pods return 404, I'd compare their logs and configs. If all pods work individually, the issue might be in the ingress path configuration — specifically, pathType: Prefix vs Exact can cause unexpected behavior, and trailing slashes can matter depending on the ingress controller."
Common Traps¶
- Blaming DNS — DNS issues cause connection failures, not 404s
- Not testing individual pods — you need to isolate which pod(s) are misbehaving
- Ignoring rolling updates — during an update, old and new pods coexist
- Not checking pathType —
PrefixvsExactis a subtle but common source of 404s
Practice and Links¶
- Runbook:
training/library/runbooks/kubernetes/ingress_404.md - Drills:
training/library/drills/kubectl_drills.md— Drill 7 (endpoints), Drill 23 (ingress rules) - Quest:
training/interactive/exercises/levels/level-24/k8s-ingress/
Wiki Navigation¶
Related Content¶
- API Gateways & Ingress (Topic Pack, L2) — Kubernetes Networking
- Case Study: CNI Broken After Restart (Case Study, L2) — Kubernetes Networking
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Networking
- Case Study: CoreDNS Timeout Pod DNS (Case Study, L2) — Kubernetes Networking
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Kubernetes Networking
- Case Study: Service Mesh 503s — Envoy Misconfigured, RBAC Policy (Case Study, L2) — Kubernetes Networking
- Case Study: Service No Endpoints (Case Study, L1) — Kubernetes Networking
- Cilium & eBPF Networking (Topic Pack, L2) — Kubernetes Networking
- Deep Dive: Kubernetes Networking (deep_dive, L2) — Kubernetes Networking
- Docker Networking Flashcards (CLI) (flashcard_deck, L1) — Kubernetes Networking
Pages that link here¶
- API Gateways & Ingress - Primer
- Cilium
- Cilium & eBPF Networking - Primer
- Interview Gauntlet: API Returning 503s
- Interview Gauntlet: Intermittent gRPC Failures
- Interview Gauntlet: Network Latency Spikes Every 30 Seconds
- Interview Scenarios
- Kubernetes Services & Ingress - Primer
- Level 5: SRE & Incident Response
- Primer
- Runbook: Ingress 502 Bad Gateway
- Runbook: Ingress Returns 404
- Runbook: NetworkPolicy Blocking Traffic
- Service Mesh (Istio / Linkerd) - Primer
- Symptoms