HAProxy & Nginx for Ops - Street-Level Ops¶
What experienced ops engineers know about running load balancers in production.
Quick Diagnosis Commands¶
# HAProxy stats via socket
echo "show stat" | socat stdio /var/run/haproxy.sock | column -t -s ','
echo "show servers state" | socat stdio /var/run/haproxy.sock
echo "show info" | socat stdio /var/run/haproxy.sock | grep -E 'Cur|Max|Idle'
# HAProxy config test
haproxy -c -f /etc/haproxy/haproxy.cfg
# Nginx config test
nginx -t
nginx -T # Show effective config (all includes resolved)
# Nginx active connections
curl -s http://localhost/nginx_status
# Or check the stub_status module output
# Check what's listening
ss -tlnp | grep -E ':(80|443|8080|8404)\b'
# Check backend connectivity from the LB
curl -s -o /dev/null -w "%{http_code} %{time_total}s" http://10.0.1.10:8080/health
# HAProxy error rates from logs
grep -c "HTTP/1.1\" 5" /var/log/haproxy.log | tail -5
awk '/5[0-9][0-9]/{count++} END{print count}' /var/log/haproxy.log
# Nginx error log
tail -50 /var/log/nginx/error.log
grep "upstream" /var/log/nginx/error.log | tail -20
# Connection states
ss -s
ss -tn state time-wait | wc -l
ss -tn state established dst 10.0.1.10 | wc -l
# Check TLS cert served by LB
echo | openssl s_client -connect localhost:443 -servername app.example.com 2>/dev/null | \
openssl x509 -noout -subject -dates
Gotcha: 502 Bad Gateway Spikes During Deploys¶
You restart a backend server. For 2-5 seconds, HAProxy/Nginx returns 502 errors because the backend is down between the old process stopping and the new one starting.
Fix:
# HAProxy: drain before restart
echo "set server app_servers/app1 state drain" | socat stdio /var/run/haproxy.sock
sleep 30 # Wait for active connections to finish
# Now restart the backend
systemctl restart myapp
# Wait for health check to pass
sleep 10
echo "set server app_servers/app1 state ready" | socat stdio /var/run/haproxy.sock
# Nginx: use upstream health checks with retry
location / {
proxy_pass http://app_backend;
proxy_next_upstream error timeout http_502 http_503;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 5s;
}
Gotcha: Nginx Resolves Upstream DNS Once at Start¶
You define proxy_pass http://api.internal.example.com:8080; in Nginx. The DNS resolves at startup. The IP changes (container restarted, failover). Nginx keeps sending traffic to the old IP.
Fix:
# Use a variable to force runtime DNS resolution
resolver 10.0.1.10 valid=30s;
server {
location /api/ {
set $backend "http://api.internal.example.com:8080";
proxy_pass $backend;
}
}
Without the resolver directive and variable, Nginx caches DNS at config load time and never re-resolves.
Under the hood: When Nginx sees
proxy_pass http://backend:8080;as a literal string, it resolves DNS once during config parsing and bakes the IP into the upstream. Theset $backendvariable trick works because Nginx must evaluate variables at request time, which forces runtime DNS resolution. This is a deliberate Nginx design choice for performance — but it bites you in dynamic environments.
Gotcha: HAProxy Maxconn Exhaustion¶
Connections queue but never fail. Response times climb from milliseconds to seconds. The stats page shows qcur (queued connections) increasing.
Fix:
# Check current connections vs maxconn
echo "show info" | socat stdio /var/run/haproxy.sock | grep -E 'CurrConns|MaxConn'
# Per-server maxconn (prevent one backend from hogging all connections)
server app1 10.0.1.10:8080 check maxconn 200
server app2 10.0.1.11:8080 check maxconn 200
# Global maxconn must accommodate all backends + overhead
global
maxconn 50000
# OS-level: increase file descriptor limits
# /etc/security/limits.d/haproxy.conf
# haproxy soft nofile 65535
# haproxy hard nofile 65535
Gotcha: Nginx worker_connections Too Low¶
Under load, you see worker_connections are not enough in the error log. Nginx stops accepting new connections.
Fix:
# Each connection uses 2 file descriptors (client + upstream)
# For 4000 concurrent connections, you need 8000+ file descriptors per worker
worker_processes auto;
worker_rlimit_nofile 65535;
events {
worker_connections 8192;
multi_accept on;
}
The formula: max_clients = worker_processes * worker_connections. With 4 workers and 8192 connections each, you handle ~32,000 concurrent connections.
Gotcha: Lost Client IP Behind the Proxy¶
Backend access logs show 10.0.1.1 (the LB) for every request instead of the real client IP. Rate limiting and geolocation are broken.
Fix:
# Nginx: set headers
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Backend application must read X-Forwarded-For header
# AND trust the proxy IP (otherwise anyone can spoof the header)
# HAProxy: use PROXY protocol for TCP mode
frontend tcp_front
bind *:443
default_backend app_servers
backend app_servers
server app1 10.0.1.10:8080 send-proxy-v2
# Backend must support PROXY protocol
# Nginx backend:
# listen 8080 proxy_protocol;
# set_real_ip_from 10.0.1.1;
# real_ip_header proxy_protocol;
Pattern: Zero-Downtime HAProxy Reload¶
# HAProxy supports hitless reloads (no dropped connections)
# The old process finishes existing connections while the new one accepts new ones
# Test config first
haproxy -c -f /etc/haproxy/haproxy.cfg
# Reload (not restart)
systemctl reload haproxy
# Verify
echo "show info" | socat stdio /var/run/haproxy.sock | grep Pid
# Compare with: pgrep haproxy
# You may see two PIDs briefly during handoff
Pattern: Gradual Traffic Shift for Canary Deploys¶
#!/bin/bash
# canary-shift.sh - gradually shift traffic from blue to green
SOCKET="/var/run/haproxy.sock"
for pct in 10 25 50 75 100; do
echo "Shifting $pct% to green..."
echo "set weight app_servers/green $pct" | socat stdio "$SOCKET"
echo "set weight app_servers/blue $((100 - pct))" | socat stdio "$SOCKET"
# Check error rate for 60 seconds
sleep 60
ERRORS=$(echo "show stat" | socat stdio "$SOCKET" | \
awk -F, '/green/{print $14}') # 5xx count
if [ "$ERRORS" -gt 10 ]; then
echo "Error rate too high, rolling back"
echo "set weight app_servers/green 0" | socat stdio "$SOCKET"
echo "set weight app_servers/blue 100" | socat stdio "$SOCKET"
exit 1
fi
done
echo "Canary complete: 100% on green"
Pattern: Nginx as a Simple TCP/UDP Load Balancer¶
# Load balance non-HTTP protocols (database, Redis, DNS)
stream {
upstream postgres_backend {
least_conn;
server 10.0.1.10:5432 max_fails=2 fail_timeout=30s;
server 10.0.1.11:5432 max_fails=2 fail_timeout=30s;
}
upstream dns_backend {
server 10.0.1.20:53;
server 10.0.1.21:53;
}
server {
listen 5432;
proxy_pass postgres_backend;
proxy_timeout 300s;
proxy_connect_timeout 5s;
}
server {
listen 53 udp;
proxy_pass dns_backend;
proxy_timeout 5s;
}
}
War story: An Nginx upstream had three backends. One backend started returning 502 errors due to a bad deploy. Nginx's default behavior is to mark the backend as "unavailable" after
max_failsconsecutive failures — but the defaultmax_failsis 1, andfail_timeoutis 10 seconds. So Nginx would send one request to the bad backend, get a 502, mark it down for 10 seconds, then try again. Users saw a 502 every 30 seconds. Addingproxy_next_upstream error timeout http_502fixed the user-facing impact immediately by retrying on the next backend.
Emergency: All Backends Down¶
# 1. Check backend status
echo "show servers state" | socat stdio /var/run/haproxy.sock
# or
curl -s http://localhost:8404/stats | grep -E "UP|DOWN"
# 2. Test backend connectivity directly
for srv in 10.0.1.10 10.0.1.11 10.0.1.12; do
echo -n "$srv: "
curl -s -o /dev/null -w "%{http_code}" "http://$srv:8080/health"
echo
done
# 3. If backends are up but health checks fail:
# Check the health check endpoint specifically
curl -v http://10.0.1.10:8080/health
# 4. Emergency: force a server UP (bypass health checks)
echo "set server app_servers/app1 state ready" | socat stdio /var/run/haproxy.sock
# WARNING: This overrides health checks — only use in emergencies
# 5. Serve a maintenance page
# HAProxy: add to frontend
# errorfile 503 /etc/haproxy/errors/maintenance.http
Gotcha: HAProxy's
option httpchk GET /healthsends a full HTTP request to backends for health checking. If the health endpoint is slow (e.g., it queries the database), the health check itself can contribute to backend overload during a degradation event. Use a lightweight health endpoint that checks only local state (process is alive, can allocate memory) — never one that depends on downstream services.
Emergency: LB Itself Is Overloaded¶
# 1. Check HAProxy resource usage
echo "show info" | socat stdio /var/run/haproxy.sock | grep -E 'Cur|Max|Ulimit'
top -p $(pgrep haproxy | head -1)
# 2. Check for file descriptor exhaustion
ls /proc/$(pgrep haproxy | head -1)/fd | wc -l
cat /proc/$(pgrep haproxy | head -1)/limits | grep "open files"
# 3. Check for SYN flood / connection flood
ss -s
netstat -an | awk '/tcp/{print $6}' | sort | uniq -c | sort -rn
# 4. Emergency: enable connection rate limiting in HAProxy
# Add to frontend:
# stick-table type ip size 100k expire 30s store conn_rate(10s)
# tcp-request connection track-sc0 src
# tcp-request connection reject if { sc_conn_rate(0) gt 100 }
# 5. Nginx: check worker status
nginx -T 2>/dev/null | grep worker
cat /var/log/nginx/error.log | grep -c "worker_connections" | tail -5