Linux Kernel Tuning - Street-Level Ops¶

What experienced Linux operators know from years of tuning production servers under load.

Quick Diagnostic Commands¶

# Show all current sysctl values
sysctl -a 2>/dev/null

# Check a specific value
sysctl net.core.somaxconn

# Check via /proc/sys directly
cat /proc/sys/net/core/somaxconn

# See which sysctl.d files are active (systemd)
systemd-analyze cat-config sysctl.d

# Show current file descriptor usage vs limit
cat /proc/sys/fs/file-nr
# 12416    0    2097152
# (allocated  free  maximum)

# Per-process file descriptor count
ls /proc/$(pidof nginx)/fd | wc -l

# Check ulimit for a running process
cat /proc/$(pidof nginx)/limits | grep "open files"
# Max open files    1048576    1048576    files

# Check current I/O scheduler
cat /sys/block/sda/queue/scheduler

# NUMA topology
numactl --hardware

# THP status
cat /sys/kernel/mm/transparent_hugepage/enabled

# Current kernel command line
cat /proc/cmdline

# Kernel version (behavior varies by version)
uname -r

Tuning for High-Connection Servers (nginx / HAProxy)¶

You deployed nginx and it works fine with 100 concurrent connections. At 10,000, you start seeing connection drops. At 50,000, the server is unresponsive even though CPU and memory are fine.

Diagnosis:

# Check for SYN queue overflow
nstat -az | grep -i syn
# TcpExtListenOverflows    12847    0.0
# TcpExtListenDrops        12847    0.0

# Check accept queue overflow
ss -ltn | head
# State    Recv-Q  Send-Q  Local Address:Port
# LISTEN   129     128     0.0.0.0:80
#          ^^^     ^^^
#          current backlog limit (too low!)

# Check for "too many open files" in logs
journalctl -u nginx --since "1 hour ago" | grep -i "open files"
dmesg | grep "Too many open files"

Fix:

# 1. Kernel network tuning
cat > /etc/sysctl.d/99-highconn.conf <<EOF
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.ip_local_port_range = 1024 65535
EOF
sysctl --system

# 2. File descriptor limits for the nginx systemd unit
mkdir -p /etc/systemd/system/nginx.service.d
cat > /etc/systemd/system/nginx.service.d/limits.conf <<EOF
[Service]
LimitNOFILE=1048576
EOF
systemctl daemon-reload
systemctl restart nginx

# 3. nginx.conf: increase worker_connections and backlog
# worker_connections 65535;
# listen 80 backlog=65535;

# 4. Verify
ss -ltn sport = :80
# Send-Q should now show 65535

Tuning for Database Workloads¶

Databases need consistent latency, not raw throughput. The key controls are swappiness, dirty page flushing, huge pages, and overcommit.

PostgreSQL tuning:

# 1. Prevent swapping database buffers
sysctl -w vm.swappiness=1

# 2. Flush dirty pages more frequently (reduce checkpoint spikes)
sysctl -w vm.dirty_background_ratio=3
sysctl -w vm.dirty_ratio=10

# 3. Disable THP (causes latency spikes during compaction)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# 4. Explicit huge pages for shared_buffers
# PostgreSQL shared_buffers = 8GB → need 8192MB / 2MB = 4096 huge pages
sysctl -w vm.nr_hugepages=4096

# Verify huge pages are allocated
grep HugePages /proc/meminfo

# 5. PostgreSQL: enable huge pages in postgresql.conf
# huge_pages = try

# 6. Strict overcommit (optional, prevents OOM surprises)
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=80

Elasticsearch tuning:

# Elasticsearch needs high max_map_count for mmap
sysctl -w vm.max_map_count=262144

# Disable THP (Elasticsearch docs explicitly require this)
echo never > /sys/kernel/mm/transparent_hugepage/enabled

# Elasticsearch manages its own heap — do not use huge pages for JVM
# Set swappiness low
sysctl -w vm.swappiness=1

# File descriptor limit (Elasticsearch opens thousands of files)
# In /etc/systemd/system/elasticsearch.service.d/override.conf:
# [Service]
# LimitNOFILE=1048576

Tuning for Container Hosts¶

Kubernetes nodes need specific sysctls for networking, inotify, and conntrack.

# /etc/sysctl.d/99-k8s-node.conf

# Required for iptables-based kube-proxy
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1

# inotify limits (kubelet, containerd, and every pod watch files)
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288

# Connection tracking (conntrack table full = packets dropped silently)
net.netfilter.nf_conntrack_max = 1048576

# File descriptors (node + all containers share the kernel limit)
fs.file-max = 2097152

# Pods may need higher max_map_count (Elasticsearch sidecars, etc.)
vm.max_map_count = 262144

# High-connection tuning (load balancer pods, ingress controllers)
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# Apply and verify
sysctl --system
sysctl net.bridge.bridge-nf-call-iptables
# net.bridge.bridge-nf-call-iptables = 1

# If bridge module not loaded, sysctl fails silently
modprobe br_netfilter
sysctl --system

Note: Kubernetes pod-level sysctls (safe and unsafe) are set via the pod's securityContext.sysctls field, not on the host. Host-level sysctls affect all pods on the node.

Debugging "Too Many Open Files"¶

This is the single most common kernel tuning issue you will encounter.

# 1. Who is hitting the limit?
# Check system-wide usage
cat /proc/sys/fs/file-nr
# 98304    0    100000    ← 98% usage, about to hit the wall

# 2. Which process is the culprit?
for pid in /proc/[0-9]*/fd; do
    count=$(ls "$pid" 2>/dev/null | wc -l)
    [ "$count" -gt 1000 ] && echo "$pid: $count fds"
done

# More efficient: use lsof
lsof -n | awk '{print $2}' | sort | uniq -c | sort -rn | head -10
#   45821 12345     ← PID 12345 has 45821 open fds
#    8204 12346

# 3. What is it opening?
ls -la /proc/12345/fd | tail -20
# Many sockets? File handles? Pipes?

lsof -p 12345 | head -20

# 4. Check the per-process limit
cat /proc/12345/limits | grep "open files"
# Limit          Soft     Hard     Units
# Max open files 1024     1024     files     ← way too low

# 5. Fix: increase at every level
sysctl -w fs.file-max=2097152                          # kernel
sysctl -w fs.nr_open=2097152                           # per-process ceiling

# For systemd services: add LimitNOFILE to the unit
# For login sessions: /etc/security/limits.conf
# For the running process: restart required to pick up new limits

Connection Timeouts Under Load¶

Users report intermittent connection timeouts. The server is not CPU- or memory-bound.

# 1. Check for listen queue overflows
nstat -az | grep -E "ListenOverflows|ListenDrops"
# TcpExtListenOverflows    5847
# TcpExtListenDrops        5847

# 2. Check current backlog vs limit
ss -ltn
# LISTEN  129  128  0.0.0.0:80
#         ^^^  ^^^
#         current queue size / max allowed

# 3. Check for SYN floods or excessive half-open connections
ss -s
# TCP:   85432 (estab 45200, closed 12000, orphaned 340, timewait 28000)

nstat -az | grep -i "SynRecv"
netstat -s | grep -i "SYNs to LISTEN"

# 4. Check for port exhaustion on outbound connections
ss -s | grep "timewait"
ss -tn state time-wait | wc -l
sysctl net.ipv4.ip_local_port_range
# 32768    60999    ← only ~28000 ports available

# 5. Fix
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# Also fix the application's listen backlog

Applying sysctl Changes Correctly¶

Gotcha: Files in /etc/sysctl.d/ are loaded in lexicographic order. A value set in 99-custom.conf overrides the same value in 50-defaults.conf. If your tuning is not taking effect, check whether a higher-numbered file (or /etc/sysctl.conf, which loads last on some distros) is clobbering your setting. Use sysctl --system 2>&1 | grep <param> to see which file wins.

# Apply a single file
sysctl -p /etc/sysctl.d/99-network-tuning.conf

# Apply ALL config files in the right order
# (sysctl.conf + sysctl.d/ + systemd overrides)
sysctl --system

# Verify a specific value took effect
sysctl net.core.somaxconn

# If a value did not stick, check:
# 1. Typo in the parameter name (sysctl silently ignores unknown params)
# 2. Module not loaded (bridge params need br_netfilter)
# 3. Another config file overriding yours (alphabetical order matters)
# 4. A systemd unit resetting the value after boot

# List all sysctl.d files in load order
ls -la /etc/sysctl.d/ /usr/lib/sysctl.d/ /run/sysctl.d/ 2>/dev/null

Checking Current Kernel Tunable Values¶

# All values, filterable
sysctl -a 2>/dev/null | grep vm.swappiness
sysctl -a 2>/dev/null | grep net.core

# Specific subsystem via /proc/sys
ls /proc/sys/net/ipv4/

# Compare running config against persistent config
# (find values that differ from what is on disk)
diff <(sysctl -a 2>/dev/null | sort) \
     <(cat /etc/sysctl.d/*.conf /etc/sysctl.conf 2>/dev/null | \
       grep -v '^#' | grep -v '^$' | sort)

Tuning I/O Scheduler for SSD vs HDD¶

# Identify disk type
cat /sys/block/sda/queue/rotational
# 0 = SSD/NVMe, 1 = HDD

# For NVMe: use none (noop) — the device handles scheduling
echo none > /sys/block/nvme0n1/queue/scheduler

# For SSD (SATA): use mq-deadline or none
echo mq-deadline > /sys/block/sda/queue/scheduler

# For HDD: use mq-deadline (prevents starvation)
echo mq-deadline > /sys/block/sdb/queue/scheduler

# Persistent via udev rule
cat > /etc/udev/rules.d/60-ioscheduler.rules <<'EOF'
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"
ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none"
EOF

# Verify after applying
cat /sys/block/sda/queue/scheduler
# mq-deadline [none] kyber bfq
#              ^^^^^^ active scheduler in brackets

Monitoring Kernel Tunable Effects¶

Tuning without measuring is guessing. Always establish a baseline before changing anything.

# Network metrics before and after
nstat -az | grep -E "Tcp|Listen|Retrans" > /tmp/before.txt
# ... apply changes, wait for traffic ...
nstat -az | grep -E "Tcp|Listen|Retrans" > /tmp/after.txt
diff /tmp/before.txt /tmp/after.txt

# Memory behavior
vmstat 1 10
# Watch si/so columns (swap in/out) — should be near zero after tuning

# Dirty page flushing
grep -E "Dirty|Writeback" /proc/meminfo
# Dirty:           1245632 kB
# Writeback:            0 kB

# I/O latency (requires ioping or fio)
ioping -c 10 /dev/sda
# 4 us — SSD with none scheduler
# vs 850 us — same SSD with bfq scheduler under load

# File descriptor pressure
watch -n 1 'cat /proc/sys/fs/file-nr'

# TCP connection states (watch for TIME_WAIT buildup)
ss -s

# Node-exporter (Prometheus) exposes most of these as metrics:
# node_filefd_allocated, node_filefd_maximum
# node_netstat_Tcp_*, node_vmstat_*
# node_memory_SwapFree_bytes, node_memory_Dirty_bytes

Pattern: Pre-Tuning Checklist¶

Before changing any kernel parameter in production:

Document the current value: sysctl <param> and save the output
Understand the parameter: read Documentation/sysctl/ in the kernel source or man sysctl.conf
Test in staging first: apply the change, run your load test, compare metrics
Apply at runtime first: sysctl -w to test; only persist after validation
Monitor for regressions: watch for OOM kills, latency changes, error rate changes
Persist and deploy: add to /etc/sysctl.d/ via configuration management (Ansible, Chef)
Verify on next reboot: confirm the value survives a reboot