Portal | Level: L2 | Domain: Kubernetes

Envoy Proxy — Street-Level Ops¶

Admin Interface¶

Envoy exposes a local admin interface (default port 9901, or 15000 in Istio sidecars). Never expose this externally — it allows runtime config changes and log-level overrides.

Default trap: Istio sidecars use port 15000 for admin, not Envoy's default 9901. If you're debugging an Istio mesh and curl localhost:9901 returns nothing, try 15000.

# Dump the full running configuration (listeners, clusters, routes, endpoints)
curl -s localhost:15000/config_dump | python3 -m json.tool | less

# Dump only clusters with their current endpoint health status
curl -s "localhost:15000/clusters?format=json" | python3 -m json.tool

# List all active listeners
curl -s localhost:15000/listeners

# Dump all stats (counters, gauges, histograms)
curl -s localhost:15000/stats

# Filter stats to circuit breaker state
curl -s localhost:15000/stats | grep circuit_breaker

# Filter stats to upstream retry counters
curl -s localhost:15000/stats | grep upstream_rq_retry

# Filter stats to 5xx rates
curl -s localhost:15000/stats | grep upstream_rq_5xx

# Live stats stream (resets on each request)
curl -s "localhost:15000/stats?format=prometheus"

# Check current log levels
curl -s localhost:15000/logging

# Set connection manager log level to debug (resets on restart)
curl -X POST "localhost:15000/logging?connection=debug"

# Reset all log levels to warning
curl -X POST "localhost:15000/logging?level=warning"

# Healthcheck endpoint (useful in init containers)
curl -s localhost:15000/ready

Reading config_dump¶

/config_dump returns a large JSON blob. Key sections:

# Extract only static listeners
curl -s localhost:15000/config_dump \
  | python3 -c "
import sys, json
d = json.load(sys.stdin)
for c in d['configs']:
    if c['@type'].endswith('ListenersConfigDump'):
        print(json.dumps(c, indent=2))
"

# Extract cluster names and their load assignment
curl -s "localhost:15000/clusters?format=json" \
  | python3 -c "
import sys, json
d = json.load(sys.stdin)
for c in d.get('cluster_statuses', []):
    print(c['name'], '—', len(c.get('host_statuses', [])), 'hosts')
"

The config_dump endpoint can return tens of MB in large meshes. Pipe through python3 -m json.tool | grep -A5 '"name"' to locate a specific cluster or listener without loading the whole blob.

Diagnosing 503s with Response Flags¶

Access log response flags are the fastest path to root cause:

Flag	Meaning	Common cause
`UF`	Upstream connection failure	Upstream pod crashed, network policy, wrong port
`UO`	Upstream overflow (circuit breaker)	Circuit breaker thresholds too low, traffic spike
`NR`	No route match	Missing route, wrong Host header, VirtualService misconfiguration
`URX`	Retry exhausted	Upstream returning 5xx, retry budget exceeded
`UT`	Upstream request timeout	Upstream too slow, timeout too tight
`RL`	Rate limited	Rate limit policy triggered
`DC`	Downstream connection terminated	Client closed before response (usually not an Envoy bug)
`LH`	Local service health check failed	Envoy health check misconfigured

Debug clue: Response flags are the single fastest path to root-causing Envoy 503s. Skip the application logs and start with grep " 503 " access.log | awk '{print $NF}' | sort | uniq -c | sort -rn. If UO dominates, raise circuit breaker thresholds. If NR dominates, check your route config.

# Count 503s by response flag in an Envoy access log
grep " 503 " /var/log/envoy/access.log \
  | awk '{print $NF}' \
  | sort | uniq -c | sort -rn

# Istio sidecar access log (JSON format)
kubectl logs <pod> -c istio-proxy \
  | python3 -c "
import sys, json
for line in sys.stdin:
    try:
        r = json.loads(line)
        if r.get('response_code') == '503':
            print(r.get('response_flags'), r.get('upstream_cluster'), r.get('path'))
    except:
        pass
"

Checking Circuit Breaker State¶

# Is the circuit breaker open right now?
curl -s localhost:15000/stats | grep "circuit_breakers\|cx_open\|rq_open\|rq_pending_open"

# Upstream overflow counter (increments each time UO is returned)
curl -s localhost:15000/stats | grep upstream_rq_pending_overflow

# Active connections to a specific cluster
curl -s localhost:15000/stats | grep "cluster.my-service.upstream_cx_active"

# Active requests to a specific cluster
curl -s localhost:15000/stats | grep "cluster.my-service.upstream_rq_active"

If upstream_rq_pending_overflow is incrementing rapidly, max_pending_requests is too low for your traffic volume.

Under the hood: Envoy circuit breakers are per-cluster, not per-route. If two routes share the same upstream cluster, they share the same circuit breaker budget. A traffic spike on one route can trip the breaker and starve the other route. Split critical routes into separate clusters if they need independent protection.

Cluster Health and Endpoint Status¶

# Show all endpoints and their health status
curl -s "localhost:15000/clusters?format=json" | python3 -m json.tool \
  | grep -A10 '"address"'

# Count healthy vs unhealthy endpoints per cluster
curl -s "localhost:15000/clusters?format=json" \
  | python3 -c "
import sys, json
d = json.load(sys.stdin)
for c in d.get('cluster_statuses', []):
    total = len(c.get('host_statuses', []))
    healthy = sum(1 for h in c.get('host_statuses', [])
                  if all(s.get('type') == 'HEALTHY' for s in h.get('health_status', {}).values()
                         if isinstance(s, dict)))
    print(f\"{c['name']}: {healthy}/{total} healthy\")
"

Access Log Format Patterns¶

Envoy's default text access log format:

[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%"
%RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT%
%DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%
"%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%"
"%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%"

For structured JSON logging (recommended for log aggregation):

typed_config:
  "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
  log_format:
    json_format:
      start_time: "%START_TIME%"
      method: "%REQ(:METHOD)%"
      path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%"
      response_code: "%RESPONSE_CODE%"
      response_flags: "%RESPONSE_FLAGS%"
      duration_ms: "%DURATION%"
      upstream_cluster: "%UPSTREAM_CLUSTER%"
      upstream_host: "%UPSTREAM_HOST%"
      request_id: "%REQ(X-REQUEST-ID)%"

Header-Based Routing Debug¶

# Send request with specific header to test routing rules
curl -H "x-env: canary" http://my-service/api/v1/health

# Inject debug header to get upstream selection info (Istio)
curl -H "x-envoy-force-trace: true" http://my-service/api/v1/health

# Check which cluster Envoy routed to by inspecting response headers
curl -v -H "x-debug: 1" http://my-service/api/ 2>&1 | grep -i "x-envoy\|server\|via"

# Test timeout behavior: request that takes longer than route timeout
curl --max-time 30 http://my-service/slow-endpoint -v

Circuit Breaker Tuning¶

Start with real traffic metrics before setting thresholds:

# P99 active connections (use this as your max_connections ceiling with headroom)
curl -s localhost:15000/stats | grep "upstream_cx_active"

# P99 pending requests (use for max_pending_requests)
curl -s localhost:15000/stats | grep "upstream_rq_pending_active"

# Max concurrent requests observed
curl -s localhost:15000/stats | grep "upstream_rq_active"

Recommended tuning formula: - max_connections = observed P99 active connections * 2 - max_pending_requests = observed P99 pending * 1.5 (intentionally tight to shed early) - max_requests = observed P99 concurrent requests * 2 - max_retries = max_requests * retry rate (usually 0.1–0.2)

Outlier Detection Tuning¶

Default outlier detection settings eject an endpoint after 5 consecutive 5xx errors, with 30-second base ejection time. Too aggressive for flapping services:

outlier_detection:
  consecutive_5xx: 10           # raise from default 5
  interval: 30s                 # evaluation window
  base_ejection_time: 30s       # start with 30s ejection
  max_ejection_percent: 50      # never eject more than half the pool
  consecutive_gateway_failure: 5
  enforcing_consecutive_5xx: 100  # 100% enforcement (vs 0 = detection only)

Set enforcing_consecutive_5xx: 0 during initial rollout to observe ejections without acting on them.

Scale note: With max_ejection_percent: 50, if you have only 2 endpoints, one 5xx burst ejects half your backend. Set max_ejection_percent proportional to your fleet size, and never let it go above 50% on small pools.