Portal | Level: L2 | Domain: Kubernetes
Envoy Proxy Footguns¶
Mistakes that cause outages, cascading failures, or silent degradation with Envoy.
1. Circuit Breaker Defaults Too Low (503 UO Under Load)¶
Envoy's default circuit breaker thresholds are max_connections: 1024, max_pending_requests: 1024, max_requests: 1024. At face value these seem generous. In a mesh where dozens of sidecars fan out to the same upstream service, the aggregate concurrent requests from all callers easily exceeds these per-sidecar defaults during load tests or traffic spikes. Every overflow returns 503 with response flag UO (upstream overflow).
The failure is silent: the application code never sees it (Envoy rejects the request before it reaches the upstream socket), the upstream service looks healthy, and the only signal is upstream_rq_pending_overflow incrementing in Envoy stats and UO in access logs.
Fix: Profile actual traffic to establish P99 active connection and request counts, then set thresholds at observed P99 * 2. Monitor upstream_rq_pending_overflow as a dashboard metric. Alert on any non-zero value in production.
Debug clue: In Envoy access logs, the
%RESPONSE_FLAGS%field is your fastest diagnostic.UO= upstream overflow (circuit breaker),UF= upstream connection failure,UT= upstream timeout,URX= retry limit exceeded. If you seeUOin production logs, the circuit breaker is tripping — checkupstream_rq_pending_overflowimmediately.
2. Not Setting Connect and Request Timeouts (Hanging Connections)¶
Envoy does not impose connect or request timeouts by default when using auto_config or when operators omit them from cluster/route configuration. A misconfigured upstream that hangs on accept, or a route with no timeout, causes Envoy workers to accumulate connections that never complete. Worker threads are not blocked (Envoy is async), but file descriptors exhaust and active-request counters climb until circuit breakers trigger — long after the root cause appeared.
Fix: Always configure three timeouts explicitly:
- Cluster connect_timeout: 5s (TCP handshake deadline)
- Route timeout: 30s (total request budget including retries)
- Route retry_policy.per_try_timeout: 10s (per-attempt budget)
Never leave timeouts unset in a production cluster configuration.
3. Outlier Detection Ejecting Too Aggressively¶
With default consecutive_5xx: 5 and max_ejection_percent: 100, a single host returning a burst of 5 errors can be ejected. If the upstream is a 3-replica StatefulSet and two replicas emit brief 5xx during a rolling restart, Envoy ejects both. Now 66% of the pool is gone. Requests pile onto the one remaining replica, which saturates and begins returning 5xx itself, gets ejected, and the entire pool is empty. Load balancing routes to nothing and every request returns 503.
Fix: Set max_ejection_percent: 50 as the minimum safe value — never eject more than half the pool. Raise consecutive_5xx to at least 10. Set base_ejection_time: 30s with max_ejection_time: 300s so ejected hosts get a chance to recover. During initial deployment, set enforcing_consecutive_5xx: 0 to observe without acting.
4. Misconfigured Retry Policy Causing Retry Storms¶
A route with retry_on: 5xx and num_retries: 3, but no per_try_timeout, interacts badly with a slow upstream. Each retry attempt inherits the full route timeout. Three retries with a 30-second route timeout means a single slow request consumes 90+ seconds of upstream capacity. When the upstream slows down, retry multiplier amplifies load by 3-4x, which slows it further, which triggers more retries — a retry storm.
Fix: Always pair num_retries with per_try_timeout that is well under the route-level timeout. A safe default: per_try_timeout = route_timeout / (num_retries + 1). Additionally, set retry_policy.retry_host_predicate: envoy.retry_host_predicates.previous_hosts to avoid retrying against the same host that just failed.
Remember: Retry amplification is multiplicative across service hops. If Service A retries 3x to Service B, which retries 3x to Service C, a single failure at C generates up to 9 requests. In a deep call chain (A -> B -> C -> D), retry budgets must decrease at each hop, or use a global retry budget header to cap total retries across the chain.
5. Not Draining Connections During Hot Restart or Pod Shutdown¶
When Kubernetes sends SIGTERM to an Envoy container, the default behavior (without a preStop hook) is that Envoy begins its shutdown sequence immediately. In-flight requests at the moment of SIGTERM receive TCP RSTs. Callers see connection resets, not clean HTTP errors, which are harder to handle gracefully and bypass application-level retry logic.
Fix: Add a preStop lifecycle hook that sleeps for 5–10 seconds (enough for the kube-proxy to drain iptables rules and for upstream services to stop sending new connections), then calls curl -X POST localhost:15000/drain_listeners?inboundonly to begin draining. Set terminationGracePeriodSeconds to at least 60s (longer than your longest expected request duration plus the preStop sleep).
6. TLS Certificate Rotation Causing Brief Downtime¶
When Envoy is configured with static TLS certificates (file-based), rotating the certificate requires reloading the configuration or restarting Envoy. If the new certificate is written to disk while Envoy is mid-handshake with a client using the old certificate, the handshake fails. In manual rotation workflows, there is a brief window where the old cert is removed before Envoy has reloaded the new one, causing TLS handshake failures.
Fix: Use SDS (Secret Discovery Service) for certificate delivery. With SDS, the management plane pushes new certificates to Envoy over the existing gRPC stream. Envoy uses the new certificate for new connections while completing existing TLS sessions with the old certificate. Rotation is atomic and zero-downtime. In Kubernetes, integrate with cert-manager's Istio support or use Istio's built-in SDS-based rotation.
7. Route Ordering Wrong (First Match Wins)¶
Envoy evaluates routes in the order they are defined — the first matching route wins. A common mistake is placing a broad prefix-match route (e.g., /api/) before a more specific exact-match or longer-prefix route (e.g., /api/admin/). The broad route captures all traffic and the specific route is never reached.
This footgun is especially painful when routes are generated programmatically or via Helm templates that do not enforce ordering. The failure is silent: no error is emitted, traffic simply routes to the wrong cluster.
Fix: Always place more-specific routes before less-specific routes. Exact matches before prefix matches. Longer prefixes before shorter prefixes. Validate route ordering in staging by sending targeted test requests to each route and checking upstream_cluster in access logs. In Istio, VirtualService routes are evaluated in order as well — the same rule applies.
8. Not Monitoring the Stats Endpoint (Missing Early Warnings)¶
Envoy exposes hundreds of actionable metrics at /stats that most teams never instrument. The result: circuit breaker trips, outlier ejections, upstream timeouts, and retry exhaustion accumulate silently for minutes or hours before the symptom (elevated error rate) appears in application-level dashboards.
Key metrics that signal trouble before SLOs are breached:
- upstream_rq_pending_overflow — circuit breaker trips
- upstream_rq_retry — retry volume (should be near zero under normal conditions)
- outlier_detection.ejections_active — current ejection count
- upstream_cx_connect_fail — failed TCP connections to upstream
- upstream_rq_timeout — request timeouts
Fix: Scrape Envoy stats in Prometheus format (/stats?format=prometheus) and add dashboards for at minimum: pending overflow, retry rate, ejections active, and timeout rate per cluster. Alert at 1% of upstream_rq_total for retries and at any non-zero value for pending_overflow.
9. WASM Filter Panics Crashing the Proxy¶
A poorly written WASM filter that panics (null pointer dereference, out-of-bounds array access, stack overflow) will cause the WASM VM to abort. Depending on fail_open vs fail_close configuration, this either passes the request through without filtering (fail open) or returns a 500 to the caller (fail close). Under high concurrency, a WASM filter that panics on a specific request pattern can degrade a large fraction of traffic before it is detected.
Unlike C++ extension points, WASM isolation prevents the panic from crashing the entire Envoy process — but the filter is effectively disabled for that VM instance until the VM resets.
Fix: Always configure WASM filters with fail_open: true in non-security-critical paths so a VM crash degrades gracefully rather than hard-failing all traffic. Monitor wasm.remote_load_fetch_successes and wasm.compile_successes at startup to confirm the filter loaded. Add integration tests that exercise error paths. Do not deploy untested WASM filters directly to production sidecars.
10. Cluster DNS Refresh Not Matching TTL¶
Envoy caches DNS resolutions for upstream cluster endpoints. The default DNS refresh rate is 5 seconds. If your upstream service uses DNS-based service discovery with a TTL shorter than 5 seconds (common in AWS ECS/Fargate or when using CoreDNS with aggressive TTLs), Envoy may hold stale IP addresses after a deployment. Requests go to old pod IPs that no longer exist, causing connection failures (UF response flag) for up to 5 seconds after each deployment.
The inverse is also a footgun: setting the DNS refresh rate too aggressively (sub-second) causes excessive DNS load on CoreDNS, which can itself become a bottleneck.
Fix: Align dns_refresh_rate with the TTL advertised by your service discovery system. For Kubernetes services (cluster-local DNS), 5–10s is appropriate. For external services with short TTLs, match the TTL exactly. Check cluster.<name>.upstream_cx_connect_fail after deployments to detect stale-DNS-induced failures.