Envoy¶

36 cards — 🟢 10 easy | 🟡 14 medium | 🔴 6 hard

🟢 Easy (10)¶

1. What is an Envoy listener?

Show answer

A listener defines the address and port where Envoy accepts incoming connections. Each listener has one or more filter chains that process the connection.

Example: listener on 0.0.0.0:8080 with HTTP connection manager routes to clusters based on host/path rules.

2. What is an Envoy cluster?

Show answer

A cluster is a logical group of upstream hosts (endpoints). It holds load-balancing policy, circuit breaker thresholds, health check configuration, and TLS settings for reaching a set of upstream services.

Example: cluster "backend-v2" with round-robin LB, 5s timeout, outlier detection ejecting after 5 consecutive 5xx.

3. What does "first match wins" mean in Envoy routing?

Show answer

Envoy evaluates route rules in order and selects the first rule that matches the request. More-specific routes (exact match, longer prefix) must be placed before broader routes, or they will never be reached.

Gotcha: a catch-all prefix: '/' route placed first will match everything — more specific routes below it are dead code. Order matters.

Analogy: like a switch-case without fall-through — the first match wins, and everything after it is ignored for that request.

4. What does xDS stand for in the context of Envoy?

Show answer

xDS stands for "x Discovery Service" — a family of APIs (LDS, RDS, CDS, EDS, SDS, ADS) that a management plane uses to dynamically push configuration to Envoy without restarts.

Remember: LRCESA — Listeners, Routes, Clusters, Endpoints, Secrets, Aggregated. Six xDS APIs for dynamic config.

5. What is the Envoy admin interface and what port does it use by default?

Show answer

The admin interface is a local HTTP endpoint (default port 9901, or 15000 in Istio sidecars) that exposes config_dump, stats, logging controls, and cluster health. It must never be exposed externally.

Gotcha: admin port exposes config_dump (secrets visible!), log controls, drain commands. Never expose beyond localhost.

6. What does the Envoy response flag UF mean?

Show answer

UF means "upstream connection failure" — Envoy could not establish or maintain a TCP connection to the upstream host. Common causes: upstream pod crashed, network policy blocking traffic, or wrong port.

Remember: UF = Upstream Failure. Check: pod running? Port correct? Network policy blocking?

7. What does the Envoy response flag UO mean?

Show answer

UO means "upstream overflow" — the circuit breaker threshold was exceeded (max_connections, max_pending_requests, or max_requests) and Envoy returned 503 rather than queue another request.

Remember: UO = Upstream Overflow. Circuit breaker tripped. Check upstream_rq_pending_overflow counter.

8. What does the Envoy response flag NR mean?

Show answer

NR means "no route" — Envoy received a request but found no matching route in the route table. Common causes: missing route configuration, wrong Host header, or VirtualService misconfiguration in Istio.

Remember: NR = No Route. Check Host header, VirtualService hosts, route prefixes.

9. What load-balancing policy should you use when you want session stickiness based on a request header?

Show answer

Ring hash (or Maglev) load balancing. Both use consistent hashing on a specified header (e.g., a session cookie or user ID) to route the same caller consistently to the same upstream host.

Example: hash on x-user-id header to route same user to same backend. Good for caching and websockets.

10. What is Envoy hot restart?

Show answer

Hot restart allows a new Envoy process to take over the listening sockets from the old process without dropping active connections. The old process drains in-flight requests while the new process handles new connections.

Under the hood: the old Envoy passes file descriptors to the new process via Unix domain sockets using SCM_RIGHTS. The kernel allows two processes to share the same listening socket.

Remember: hot restart enables config changes and binary upgrades without connection drops — essential for service mesh proxies handling thousands of active connections.

🟡 Medium (14)¶

1. Why is ADS (Aggregated Discovery Service) safer than using separate xDS streams for CDS, EDS, and LDS?

Show answer

With separate streams, CDS, EDS, and LDS updates can arrive out of order: a new cluster might appear before its endpoints, causing brief 503s. ADS delivers all resource types on a single ordered stream, so Envoy applies updates atomically and consistently.

2. Which Envoy stat should you alert on to detect circuit breaker trips in production?

Show answer

upstream_rq_pending_overflow — it increments every time a request is rejected because max_pending_requests was exceeded. Any non-zero value in production indicates the circuit breaker is actively shedding load.

Gotcha: A zero value is normal at startup. Any non-zero in production means you are actively shedding user traffic.

Debug clue: Correlate with upstream_rq_active to see if backends are saturated or if thresholds are too low.

3. What are Envoy's default circuit breaker thresholds and why are they often wrong?

Show answer

Defaults are max_connections: 1024, max_pending_requests: 1024, max_requests: 1024. They are often wrong because they are per-sidecar — in a mesh with many callers fanning out to the same service, the aggregate load quickly exceeds these limits, causing spurious 503 UO errors.

4. What is the danger of setting num_retries without also setting per_try_timeout?

Show answer

Without per_try_timeout, each retry attempt inherits the full route timeout. Under a slow upstream, multiple retries each run to the full timeout, multiplying the load on the upstream by the retry count and causing retry storms.

5. What does Envoy outlier detection do and how does it differ from active health checks?

Show answer

Outlier detection passively monitors upstream hosts for failure patterns (consecutive 5xx, gateway errors, high latency) and ejects misbehaving hosts from the load-balancing pool. Active health checks probe endpoints on a schedule. Outlier detection reacts to real traffic; health checks detect failures even with no traffic.

6. What is the Proxy-WASM ABI and why does it matter?

Show answer

Proxy-WASM is a vendor-neutral WebAssembly ABI specification for proxy plugins. A WASM filter written to the Proxy-WASM ABI can run on Envoy, NGINX (via ngx_wasm_module), and other compatible proxies — enabling portable filter code across implementations.

7. How does Envoy handle gRPC-Web requests from browsers that cannot use HTTP/2 trailers?

Show answer

Envoy's gRPC-Web filter receives an HTTP/1.1 (or trailerless HTTP/2) gRPC-Web request from the browser, translates it to a standard gRPC request to the upstream, and translates the response (including trailers encoding the gRPC status) back to gRPC-Web format.

8. Why is SDS (Secret Discovery Service) preferred over file-based TLS certificates in Envoy?

Show answer

SDS allows a management plane to push new certificates to Envoy over an existing gRPC stream. Envoy uses the new cert for new TLS sessions while completing existing sessions with the old cert — zero-downtime rotation. File-based rotation requires a config reload or restart, creating a window where TLS handshakes can fail.

9. What does the Envoy response flag URX mean?

Show answer

URX means "upstream retry exhausted" — Envoy attempted retries up to the configured num_retries limit and all attempts failed. The final response returned to the client is the last upstream failure.

Remember: URX = Upstream Retry eXhausted. All retries failed. Check num_retries and per_try_timeout.

10. What happens if Envoy's DNS refresh rate is longer than your upstream service's DNS TTL?

Show answer

Envoy caches stale DNS entries and routes requests to old IP addresses after deployments. Connections to the stale IPs fail (UF response flag). The failure lasts until the DNS cache expires at the configured refresh interval.

11. What does zone-aware load balancing do in Envoy?

Show answer

Zone-aware routing biases traffic toward upstream endpoints in the same availability zone as the Envoy instance, reducing cross-AZ latency and data transfer costs. It falls back to cross-zone routing when local zone capacity is insufficient to serve the load.

12. How does Envoy implement traffic splitting (e.g., 90% to v1, 10% to v2)?

Show answer

Routes support weighted cluster assignments. The route config maps a single route match to multiple clusters each with a weight. Envoy distributes traffic proportionally — no DNS change or additional load balancer required.

13. Why do Envoy containers in Kubernetes need a preStop lifecycle hook?

Show answer

Without a preStop hook, Kubernetes sends SIGTERM and kills the container immediately. In-flight requests receive TCP RSTs instead of clean HTTP responses. A preStop sleep gives kube-proxy time to drain iptables rules and allows Envoy to stop accepting new connections gracefully before shutdown.

14. What Envoy stat tracks how many upstream requests are currently in flight to a cluster?

Show answer

cluster..upstream_rq_active — a gauge showing the current number of active (in-flight) requests to that cluster. Correlate with max_requests circuit breaker threshold to detect saturation.

Debug clue: If upstream_rq_active approaches max_requests, circuit breaker trips are imminent. Alert at 80% of threshold.

Example: `curl localhost:15000/stats | grep upstream_rq_active` on a sidecar to check live values.

🔴 Hard (6)¶

1. What ordering problem does ADS solve that per-resource xDS streams cannot?

Show answer

With separate CDS, EDS, and LDS streams, a race condition exists: LDS may deliver a listener referencing a new cluster before CDS delivers that cluster's definition, or CDS may deliver a cluster before EDS delivers its endpoints. Envoy processes updates as they arrive; without ordering guarantees, intermediate states can produce NR or UF errors. ADS delivers all resource types on one ordered stream and Envoy applies them as a batch, eliminating the race.

2. How can outlier detection accidentally eject an entire upstream cluster and what prevents this?

Show answer

If max_ejection_percent defaults to 100%, every host in a cluster can be ejected. During a rolling restart when multiple hosts briefly return 5xx, consecutive_5xx ejections cascade: ejected hosts remove load from the pool, remaining hosts saturate and also return 5xx, and are ejected in turn. The entire pool empties. Prevention: set max_ejection_percent to 50 (never eject more than half), raise consecutive_5xx to at least 10, and set enforcing_consecutive_5xx to 0 during initial rollout.

3. What happens to traffic when a WASM filter VM crashes inside Envoy, and how does fail_open vs fail_close affect this?

Show answer

When a WASM VM crashes, the filter is aborted for that request. With fail_open: true, Envoy passes the request through without filtering (degraded but functional). With fail_close: false (the default), Envoy returns a 500 to the caller. The Envoy process itself does not crash — WASM sandboxing isolates the VM fault. Monitor wasm.runtime_errors to detect crash loops.

4. What is the mechanism by which Envoy hot restart passes listening sockets from the old to the new process?

Show answer

The old Envoy process acts as the "hot restart parent." The new process connects to the parent via a Unix domain socket and sends a request for each listening socket's file descriptor. The kernel passes the open file descriptors via SCM_RIGHTS ancillary data in a sendmsg call. The new process then binds to those fds and begins accepting connections, while the parent enters draining mode and closes its accept loop.

5. What is a practical strategy for extracting a specific cluster's configuration from a large Envoy config_dump without loading the entire blob into memory?

Show answer

Stream the config_dump through a pipeline: curl -s localhost:15000/config_dump | python3 -c "import sys,json; [print(json.dumps(c,indent=2)) for c in json.load(sys.stdin)['configs'] if 'ClustersConfigDump' in c.get('@type','')]" | grep -A20 '"name": "my-cluster"'. Alternatively use the /clusters?format=json endpoint which returns only cluster state and is much smaller than the full config_dump.

6. What combination of Envoy features prevents retry storms while still providing retry protection?

Show answer

Three controls together: (1) per_try_timeout set to route_timeout / (num_retries + 1) to bound retry duration; (2) retry_priority or retry_host_predicate: previous_hosts to avoid retrying the same failed host; (3) max_retries circuit breaker threshold to cap total concurrent retries across all requests to a cluster. Without (3), the global retry concurrency is unbounded — thousands of requests each retrying 3x triples upstream load.