Portal | Level: L2 | Domain: Kubernetes
Envoy Proxy — Street-Level Ops¶
Admin Interface¶
Envoy exposes a local admin interface (default port 9901, or 15000 in Istio sidecars). Never expose this externally — it allows runtime config changes and log-level overrides.
Default trap: Istio sidecars use port 15000 for admin, not Envoy's default 9901. If you're debugging an Istio mesh and
curl localhost:9901returns nothing, try 15000.
# Dump the full running configuration (listeners, clusters, routes, endpoints)
curl -s localhost:15000/config_dump | python3 -m json.tool | less
# Dump only clusters with their current endpoint health status
curl -s "localhost:15000/clusters?format=json" | python3 -m json.tool
# List all active listeners
curl -s localhost:15000/listeners
# Dump all stats (counters, gauges, histograms)
curl -s localhost:15000/stats
# Filter stats to circuit breaker state
curl -s localhost:15000/stats | grep circuit_breaker
# Filter stats to upstream retry counters
curl -s localhost:15000/stats | grep upstream_rq_retry
# Filter stats to 5xx rates
curl -s localhost:15000/stats | grep upstream_rq_5xx
# Live stats stream (resets on each request)
curl -s "localhost:15000/stats?format=prometheus"
# Check current log levels
curl -s localhost:15000/logging
# Set connection manager log level to debug (resets on restart)
curl -X POST "localhost:15000/logging?connection=debug"
# Reset all log levels to warning
curl -X POST "localhost:15000/logging?level=warning"
# Healthcheck endpoint (useful in init containers)
curl -s localhost:15000/ready
Reading config_dump¶
/config_dump returns a large JSON blob. Key sections:
# Extract only static listeners
curl -s localhost:15000/config_dump \
| python3 -c "
import sys, json
d = json.load(sys.stdin)
for c in d['configs']:
if c['@type'].endswith('ListenersConfigDump'):
print(json.dumps(c, indent=2))
"
# Extract cluster names and their load assignment
curl -s "localhost:15000/clusters?format=json" \
| python3 -c "
import sys, json
d = json.load(sys.stdin)
for c in d.get('cluster_statuses', []):
print(c['name'], '—', len(c.get('host_statuses', [])), 'hosts')
"
The config_dump endpoint can return tens of MB in large meshes. Pipe through python3 -m json.tool | grep -A5 '"name"' to locate a specific cluster or listener without loading the whole blob.
Diagnosing 503s with Response Flags¶
Access log response flags are the fastest path to root cause:
| Flag | Meaning | Common cause |
|---|---|---|
UF |
Upstream connection failure | Upstream pod crashed, network policy, wrong port |
UO |
Upstream overflow (circuit breaker) | Circuit breaker thresholds too low, traffic spike |
NR |
No route match | Missing route, wrong Host header, VirtualService misconfiguration |
URX |
Retry exhausted | Upstream returning 5xx, retry budget exceeded |
UT |
Upstream request timeout | Upstream too slow, timeout too tight |
RL |
Rate limited | Rate limit policy triggered |
DC |
Downstream connection terminated | Client closed before response (usually not an Envoy bug) |
LH |
Local service health check failed | Envoy health check misconfigured |
Debug clue: Response flags are the single fastest path to root-causing Envoy 503s. Skip the application logs and start with
grep " 503 " access.log | awk '{print $NF}' | sort | uniq -c | sort -rn. If UO dominates, raise circuit breaker thresholds. If NR dominates, check your route config.
# Count 503s by response flag in an Envoy access log
grep " 503 " /var/log/envoy/access.log \
| awk '{print $NF}' \
| sort | uniq -c | sort -rn
# Istio sidecar access log (JSON format)
kubectl logs <pod> -c istio-proxy \
| python3 -c "
import sys, json
for line in sys.stdin:
try:
r = json.loads(line)
if r.get('response_code') == '503':
print(r.get('response_flags'), r.get('upstream_cluster'), r.get('path'))
except:
pass
"
Checking Circuit Breaker State¶
# Is the circuit breaker open right now?
curl -s localhost:15000/stats | grep "circuit_breakers\|cx_open\|rq_open\|rq_pending_open"
# Upstream overflow counter (increments each time UO is returned)
curl -s localhost:15000/stats | grep upstream_rq_pending_overflow
# Active connections to a specific cluster
curl -s localhost:15000/stats | grep "cluster.my-service.upstream_cx_active"
# Active requests to a specific cluster
curl -s localhost:15000/stats | grep "cluster.my-service.upstream_rq_active"
If upstream_rq_pending_overflow is incrementing rapidly, max_pending_requests is too low for your traffic volume.
Under the hood: Envoy circuit breakers are per-cluster, not per-route. If two routes share the same upstream cluster, they share the same circuit breaker budget. A traffic spike on one route can trip the breaker and starve the other route. Split critical routes into separate clusters if they need independent protection.
Cluster Health and Endpoint Status¶
# Show all endpoints and their health status
curl -s "localhost:15000/clusters?format=json" | python3 -m json.tool \
| grep -A10 '"address"'
# Count healthy vs unhealthy endpoints per cluster
curl -s "localhost:15000/clusters?format=json" \
| python3 -c "
import sys, json
d = json.load(sys.stdin)
for c in d.get('cluster_statuses', []):
total = len(c.get('host_statuses', []))
healthy = sum(1 for h in c.get('host_statuses', [])
if all(s.get('type') == 'HEALTHY' for s in h.get('health_status', {}).values()
if isinstance(s, dict)))
print(f\"{c['name']}: {healthy}/{total} healthy\")
"
Access Log Format Patterns¶
Envoy's default text access log format:
[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%"
%RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT%
%DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%
"%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%"
"%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%"
For structured JSON logging (recommended for log aggregation):
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
log_format:
json_format:
start_time: "%START_TIME%"
method: "%REQ(:METHOD)%"
path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%"
response_code: "%RESPONSE_CODE%"
response_flags: "%RESPONSE_FLAGS%"
duration_ms: "%DURATION%"
upstream_cluster: "%UPSTREAM_CLUSTER%"
upstream_host: "%UPSTREAM_HOST%"
request_id: "%REQ(X-REQUEST-ID)%"
Header-Based Routing Debug¶
# Send request with specific header to test routing rules
curl -H "x-env: canary" http://my-service/api/v1/health
# Inject debug header to get upstream selection info (Istio)
curl -H "x-envoy-force-trace: true" http://my-service/api/v1/health
# Check which cluster Envoy routed to by inspecting response headers
curl -v -H "x-debug: 1" http://my-service/api/ 2>&1 | grep -i "x-envoy\|server\|via"
# Test timeout behavior: request that takes longer than route timeout
curl --max-time 30 http://my-service/slow-endpoint -v
Circuit Breaker Tuning¶
Start with real traffic metrics before setting thresholds:
# P99 active connections (use this as your max_connections ceiling with headroom)
curl -s localhost:15000/stats | grep "upstream_cx_active"
# P99 pending requests (use for max_pending_requests)
curl -s localhost:15000/stats | grep "upstream_rq_pending_active"
# Max concurrent requests observed
curl -s localhost:15000/stats | grep "upstream_rq_active"
Recommended tuning formula:
- max_connections = observed P99 active connections * 2
- max_pending_requests = observed P99 pending * 1.5 (intentionally tight to shed early)
- max_requests = observed P99 concurrent requests * 2
- max_retries = max_requests * retry rate (usually 0.1–0.2)
Outlier Detection Tuning¶
Default outlier detection settings eject an endpoint after 5 consecutive 5xx errors, with 30-second base ejection time. Too aggressive for flapping services:
outlier_detection:
consecutive_5xx: 10 # raise from default 5
interval: 30s # evaluation window
base_ejection_time: 30s # start with 30s ejection
max_ejection_percent: 50 # never eject more than half the pool
consecutive_gateway_failure: 5
enforcing_consecutive_5xx: 100 # 100% enforcement (vs 0 = detection only)
Set enforcing_consecutive_5xx: 0 during initial rollout to observe ejections without acting on them.
Scale note: With
max_ejection_percent: 50, if you have only 2 endpoints, one 5xx burst ejects half your backend. Setmax_ejection_percentproportional to your fleet size, and never let it go above 50% on small pools.