Skip to content

HAProxy & Nginx for Ops - Street-Level Ops

What experienced ops engineers know about running load balancers in production.

Quick Diagnosis Commands

# HAProxy stats via socket
echo "show stat" | socat stdio /var/run/haproxy.sock | column -t -s ','
echo "show servers state" | socat stdio /var/run/haproxy.sock
echo "show info" | socat stdio /var/run/haproxy.sock | grep -E 'Cur|Max|Idle'

# HAProxy config test
haproxy -c -f /etc/haproxy/haproxy.cfg

# Nginx config test
nginx -t
nginx -T  # Show effective config (all includes resolved)

# Nginx active connections
curl -s http://localhost/nginx_status
# Or check the stub_status module output

# Check what's listening
ss -tlnp | grep -E ':(80|443|8080|8404)\b'

# Check backend connectivity from the LB
curl -s -o /dev/null -w "%{http_code} %{time_total}s" http://10.0.1.10:8080/health

# HAProxy error rates from logs
grep -c "HTTP/1.1\" 5" /var/log/haproxy.log | tail -5
awk '/5[0-9][0-9]/{count++} END{print count}' /var/log/haproxy.log

# Nginx error log
tail -50 /var/log/nginx/error.log
grep "upstream" /var/log/nginx/error.log | tail -20

# Connection states
ss -s
ss -tn state time-wait | wc -l
ss -tn state established dst 10.0.1.10 | wc -l

# Check TLS cert served by LB
echo | openssl s_client -connect localhost:443 -servername app.example.com 2>/dev/null | \
    openssl x509 -noout -subject -dates

Gotcha: 502 Bad Gateway Spikes During Deploys

You restart a backend server. For 2-5 seconds, HAProxy/Nginx returns 502 errors because the backend is down between the old process stopping and the new one starting.

Fix:

# HAProxy: drain before restart
echo "set server app_servers/app1 state drain" | socat stdio /var/run/haproxy.sock
sleep 30  # Wait for active connections to finish
# Now restart the backend
systemctl restart myapp
# Wait for health check to pass
sleep 10
echo "set server app_servers/app1 state ready" | socat stdio /var/run/haproxy.sock

# Nginx: use upstream health checks with retry
location / {
    proxy_pass http://app_backend;
    proxy_next_upstream error timeout http_502 http_503;
    proxy_next_upstream_tries 2;
    proxy_next_upstream_timeout 5s;
}

Gotcha: Nginx Resolves Upstream DNS Once at Start

You define proxy_pass http://api.internal.example.com:8080; in Nginx. The DNS resolves at startup. The IP changes (container restarted, failover). Nginx keeps sending traffic to the old IP.

Fix:

# Use a variable to force runtime DNS resolution
resolver 10.0.1.10 valid=30s;

server {
    location /api/ {
        set $backend "http://api.internal.example.com:8080";
        proxy_pass $backend;
    }
}

Without the resolver directive and variable, Nginx caches DNS at config load time and never re-resolves.

Under the hood: When Nginx sees proxy_pass http://backend:8080; as a literal string, it resolves DNS once during config parsing and bakes the IP into the upstream. The set $backend variable trick works because Nginx must evaluate variables at request time, which forces runtime DNS resolution. This is a deliberate Nginx design choice for performance — but it bites you in dynamic environments.

Gotcha: HAProxy Maxconn Exhaustion

Connections queue but never fail. Response times climb from milliseconds to seconds. The stats page shows qcur (queued connections) increasing.

Fix:

# Check current connections vs maxconn
echo "show info" | socat stdio /var/run/haproxy.sock | grep -E 'CurrConns|MaxConn'

# Per-server maxconn (prevent one backend from hogging all connections)
server app1 10.0.1.10:8080 check maxconn 200
server app2 10.0.1.11:8080 check maxconn 200

# Global maxconn must accommodate all backends + overhead
global
    maxconn 50000

# OS-level: increase file descriptor limits
# /etc/security/limits.d/haproxy.conf
# haproxy soft nofile 65535
# haproxy hard nofile 65535

Gotcha: Nginx worker_connections Too Low

Under load, you see worker_connections are not enough in the error log. Nginx stops accepting new connections.

Fix:

# Each connection uses 2 file descriptors (client + upstream)
# For 4000 concurrent connections, you need 8000+ file descriptors per worker

worker_processes auto;
worker_rlimit_nofile 65535;

events {
    worker_connections 8192;
    multi_accept on;
}

The formula: max_clients = worker_processes * worker_connections. With 4 workers and 8192 connections each, you handle ~32,000 concurrent connections.

Gotcha: Lost Client IP Behind the Proxy

Backend access logs show 10.0.1.1 (the LB) for every request instead of the real client IP. Rate limiting and geolocation are broken.

Fix:

# Nginx: set headers
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

# Backend application must read X-Forwarded-For header
# AND trust the proxy IP (otherwise anyone can spoof the header)

# HAProxy: use PROXY protocol for TCP mode
frontend tcp_front
    bind *:443
    default_backend app_servers

backend app_servers
    server app1 10.0.1.10:8080 send-proxy-v2

# Backend must support PROXY protocol
# Nginx backend:
# listen 8080 proxy_protocol;
# set_real_ip_from 10.0.1.1;
# real_ip_header proxy_protocol;

Pattern: Zero-Downtime HAProxy Reload

# HAProxy supports hitless reloads (no dropped connections)
# The old process finishes existing connections while the new one accepts new ones

# Test config first
haproxy -c -f /etc/haproxy/haproxy.cfg

# Reload (not restart)
systemctl reload haproxy

# Verify
echo "show info" | socat stdio /var/run/haproxy.sock | grep Pid
# Compare with: pgrep haproxy
# You may see two PIDs briefly during handoff

Pattern: Gradual Traffic Shift for Canary Deploys

#!/bin/bash
# canary-shift.sh - gradually shift traffic from blue to green
SOCKET="/var/run/haproxy.sock"

for pct in 10 25 50 75 100; do
    echo "Shifting $pct% to green..."
    echo "set weight app_servers/green $pct" | socat stdio "$SOCKET"
    echo "set weight app_servers/blue $((100 - pct))" | socat stdio "$SOCKET"

    # Check error rate for 60 seconds
    sleep 60
    ERRORS=$(echo "show stat" | socat stdio "$SOCKET" | \
        awk -F, '/green/{print $14}')  # 5xx count

    if [ "$ERRORS" -gt 10 ]; then
        echo "Error rate too high, rolling back"
        echo "set weight app_servers/green 0" | socat stdio "$SOCKET"
        echo "set weight app_servers/blue 100" | socat stdio "$SOCKET"
        exit 1
    fi
done

echo "Canary complete: 100% on green"

Pattern: Nginx as a Simple TCP/UDP Load Balancer

# Load balance non-HTTP protocols (database, Redis, DNS)
stream {
    upstream postgres_backend {
        least_conn;
        server 10.0.1.10:5432 max_fails=2 fail_timeout=30s;
        server 10.0.1.11:5432 max_fails=2 fail_timeout=30s;
    }

    upstream dns_backend {
        server 10.0.1.20:53;
        server 10.0.1.21:53;
    }

    server {
        listen 5432;
        proxy_pass postgres_backend;
        proxy_timeout 300s;
        proxy_connect_timeout 5s;
    }

    server {
        listen 53 udp;
        proxy_pass dns_backend;
        proxy_timeout 5s;
    }
}

War story: An Nginx upstream had three backends. One backend started returning 502 errors due to a bad deploy. Nginx's default behavior is to mark the backend as "unavailable" after max_fails consecutive failures — but the default max_fails is 1, and fail_timeout is 10 seconds. So Nginx would send one request to the bad backend, get a 502, mark it down for 10 seconds, then try again. Users saw a 502 every 30 seconds. Adding proxy_next_upstream error timeout http_502 fixed the user-facing impact immediately by retrying on the next backend.

Emergency: All Backends Down

# 1. Check backend status
echo "show servers state" | socat stdio /var/run/haproxy.sock
# or
curl -s http://localhost:8404/stats | grep -E "UP|DOWN"

# 2. Test backend connectivity directly
for srv in 10.0.1.10 10.0.1.11 10.0.1.12; do
    echo -n "$srv: "
    curl -s -o /dev/null -w "%{http_code}" "http://$srv:8080/health"
    echo
done

# 3. If backends are up but health checks fail:
# Check the health check endpoint specifically
curl -v http://10.0.1.10:8080/health

# 4. Emergency: force a server UP (bypass health checks)
echo "set server app_servers/app1 state ready" | socat stdio /var/run/haproxy.sock
# WARNING: This overrides health checks — only use in emergencies

# 5. Serve a maintenance page
# HAProxy: add to frontend
# errorfile 503 /etc/haproxy/errors/maintenance.http

Gotcha: HAProxy's option httpchk GET /health sends a full HTTP request to backends for health checking. If the health endpoint is slow (e.g., it queries the database), the health check itself can contribute to backend overload during a degradation event. Use a lightweight health endpoint that checks only local state (process is alive, can allocate memory) — never one that depends on downstream services.

Emergency: LB Itself Is Overloaded

# 1. Check HAProxy resource usage
echo "show info" | socat stdio /var/run/haproxy.sock | grep -E 'Cur|Max|Ulimit'
top -p $(pgrep haproxy | head -1)

# 2. Check for file descriptor exhaustion
ls /proc/$(pgrep haproxy | head -1)/fd | wc -l
cat /proc/$(pgrep haproxy | head -1)/limits | grep "open files"

# 3. Check for SYN flood / connection flood
ss -s
netstat -an | awk '/tcp/{print $6}' | sort | uniq -c | sort -rn

# 4. Emergency: enable connection rate limiting in HAProxy
# Add to frontend:
# stick-table type ip size 100k expire 30s store conn_rate(10s)
# tcp-request connection track-sc0 src
# tcp-request connection reject if { sc_conn_rate(0) gt 100 }

# 5. Nginx: check worker status
nginx -T 2>/dev/null | grep worker
cat /var/log/nginx/error.log | grep -c "worker_connections" | tail -5