Skip to content

Portal | Level: L2: Operations | Topics: Linux Kernel Tuning, Linux Performance Tuning | Domain: Linux

Linux Kernel Tuning - Primer

Why This Matters

Every production Linux server ships with kernel defaults tuned for a generic desktop workload. Those defaults are safe and broadly compatible, but they are rarely optimal for a web server handling 50,000 concurrent connections, a database managing 200GB of working set, or a container host running 80 pods. Kernel tuning is how you close the gap between "works" and "works well under load."

The difference is not academic. I have seen a single net.core.somaxconn change take an nginx server from dropping connections under load to handling 3x the traffic. I have watched a misconfigured vm.swappiness turn a perfectly sized database server into a thrashing mess during a traffic spike. These are not exotic scenarios. They are Tuesday.

Kernel tuning is also one of the most dangerous areas of system administration. A bad value in /proc/sys/vm/overcommit_memory can cause the OOM killer to never fire — until the box locks up completely. Understanding what each knob does, why the default is what it is, and what changes when you turn it is essential before you touch anything in production.

Core Concepts

1. The sysctl Interface

The kernel exposes tunable parameters through the /proc/sys/ virtual filesystem. Every file under /proc/sys/ corresponds to a kernel parameter. The sysctl command is a convenience wrapper for reading and writing these files.

# Read a parameter via /proc/sys
cat /proc/sys/net/core/somaxconn
# 4096

# Read the same parameter via sysctl
sysctl net.core.somaxconn
# net.core.somaxconn = 4096

# Set a parameter at runtime (non-persistent)
sysctl -w net.core.somaxconn=65535

# Same thing via /proc/sys
echo 65535 > /proc/sys/net/core/somaxconn

The mapping is straightforward: dots in the sysctl name become slashes in the /proc/sys/ path. net.core.somaxconn maps to /proc/sys/net/core/somaxconn.

Remember: Mnemonic for the sysctl-to-path translation: "Dots become slashes, root is /proc/sys/." If you forget the sysctl command, you can always cat or echo directly to the /proc/sys/ path — they are equivalent.

Persistent Configuration

Runtime changes via sysctl -w are lost on reboot. To make them permanent:

# Option 1: /etc/sysctl.conf (traditional, single file)
echo "net.core.somaxconn = 65535" >> /etc/sysctl.conf

# Option 2: /etc/sysctl.d/ (modern, modular — preferred)
cat > /etc/sysctl.d/99-network-tuning.conf <<EOF
# High-connection server tuning
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
EOF

# Apply all persistent config files
sysctl --system

Files in /etc/sysctl.d/ are processed in alphabetical order. Higher-numbered prefixes override lower ones. Use 99- for application-specific tuning to ensure it runs last.

# Verify what is currently loaded
sysctl -a | grep somaxconn

# Check which file set a value (systemd systems)
systemd-analyze cat-config sysctl.d

2. Network Tuning

Network parameters are the most commonly tuned category because network-heavy workloads (web servers, load balancers, API gateways) hit default limits first.

Connection Backlog

When a SYN packet arrives, the kernel places the half-open connection in the SYN backlog. Once the three-way handshake completes, the connection moves to the accept queue. The application calls accept() to pull connections off the accept queue.

Client SYN  →  [ SYN backlog ]  →  handshake completes  →  [ accept queue ]  →  accept()
              tcp_max_syn_backlog                           somaxconn
# /etc/sysctl.d/99-network-tuning.conf

# Max length of the accept queue (per socket)
# Default: 4096 — too low for high-traffic servers
net.core.somaxconn = 65535

# Max half-open connections (SYN received, ACK not yet received)
# Default: 1024 — too low under SYN pressure
net.ipv4.tcp_max_syn_backlog = 65535

# Max packets queued on the INPUT side when the interface
# receives packets faster than the kernel can process them
# Default: 1000 — increase for 10GbE+ interfaces
net.core.netdev_max_backlog = 65535

Important: The application must also request a large backlog. In nginx, this is the backlog parameter on the listen directive. In HAProxy, it is maxconn. The kernel limits cap the application's request — if the application asks for 128 but the kernel allows 65535, you get 128.

TCP Timers and Reuse

# Reuse TIME_WAIT sockets for new outbound connections
# Safe for client-side connections; no effect on server accept()
net.ipv4.tcp_tw_reuse = 1

# How long to wait in FIN_WAIT_2 before killing the socket
# Default: 60 — reduce for servers making many short-lived outbound connections
net.ipv4.tcp_fin_timeout = 15

# TCP keepalive: how long before sending the first keepalive probe
# Default: 7200 (2 hours) — far too long for most applications
net.ipv4.tcp_keepalive_time = 300

# Interval between keepalive probes after the first
net.ipv4.tcp_keepalive_intvl = 30

# Number of unacknowledged probes before declaring the connection dead
net.ipv4.tcp_keepalive_probes = 5

# Ephemeral port range for outbound connections
# Default: 32768-60999 — expand if you make many outbound connections
net.ipv4.ip_local_port_range = 1024 65535

Socket Buffer Sizes

Socket buffers determine how much data can be in-flight for a single connection. For high-bandwidth, high-latency links (cross-region, transcontinental), the default buffers are too small.

# Maximum receive/send socket buffer sizes (bytes)
# These set the ceiling for SO_RCVBUF/SO_SNDBUF
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# TCP auto-tuning buffer sizes: min, default, max (bytes)
# The kernel auto-tunes between min and max per connection
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

The bandwidth-delay product (BDP) determines the optimal buffer size: BDP = bandwidth (bytes/sec) * RTT (seconds). A 1 Gbps link with 50ms RTT needs at least 125000000 * 0.05 = 6.25 MB of buffer to fully utilize the link.

3. Memory Tuning

Swappiness

vm.swappiness controls how aggressively the kernel swaps anonymous memory (heap, stack) versus dropping file cache pages.

# Range: 0-200 (kernels >= 5.8), 0-100 (older kernels)
# Default: 60

# Database servers: prefer keeping data in memory, tolerate less file cache
vm.swappiness = 10

# Containers/general: default is usually fine
vm.swappiness = 60

# Never swap (almost): the kernel will still swap under extreme pressure
vm.swappiness = 1

Setting vm.swappiness = 0 does not disable swap entirely. On modern kernels (>= 3.5), it tells the kernel to avoid swapping until the system is critically low on memory. On older kernels, it had a different meaning. See the footguns section.

Dirty Page Ratios

When applications write to files, the data goes to page cache first (dirty pages). The kernel flushes dirty pages to disk in the background. These parameters control when flushing starts and when writes block.

# Start background writeback when dirty pages exceed this % of available memory
# Default: 10
vm.dirty_background_ratio = 5

# Block processes writing when dirty pages exceed this % of available memory
# Default: 20
vm.dirty_ratio = 10

Lower values mean more frequent, smaller flushes — better for databases that need consistent write latency. Higher values allow more buffering — better for throughput-heavy batch workloads.

Overcommit

Linux memory overcommit controls whether the kernel allows processes to allocate more virtual memory than is physically available.

# 0 = heuristic overcommit (default): kernel guesses whether to allow
# 1 = always overcommit: never fail malloc() — dangerous, see footguns
# 2 = strict overcommit: limit to swap + (overcommit_ratio% * physical RAM)
vm.overcommit_memory = 0

# Only relevant when overcommit_memory = 2
vm.overcommit_ratio = 50

Redis requires vm.overcommit_memory = 1 for background persistence (fork + copy-on-write). It will log warnings if this is not set.

Other Memory Parameters

# Minimum free memory the kernel reserves (KB)
# Too low: under pressure, the kernel cannot allocate for itself
# Too high: wastes memory that applications could use
# Default: auto-calculated, usually 64MB
vm.min_free_kbytes = 131072

# Maximum number of memory map areas per process
# Default: 65536 — too low for Elasticsearch, MongoDB, large JVMs
# Elasticsearch requires at least 262144
vm.max_map_count = 262144

4. File Descriptor Limits

Every open file, socket, pipe, and device uses a file descriptor. The kernel has a system-wide limit and a per-process limit. Both must be increased for high-connection servers.

# System-wide maximum file descriptors
# Default: ~100000 (varies by RAM)
# Check current: cat /proc/sys/fs/file-max
fs.file-max = 2097152

# Maximum file descriptors a single process can open
# This is the ceiling for ulimit -n
fs.nr_open = 2097152

The per-process limit is controlled by ulimit and PAM limits:

# Check current per-process limits
ulimit -n        # soft limit
ulimit -Hn       # hard limit

# /etc/security/limits.conf (or /etc/security/limits.d/99-nofile.conf)
*    soft    nofile    1048576
*    hard    nofile    1048576
root soft    nofile    1048576
root hard    nofile    1048576

For systemd services, the limits come from the unit file, not limits.conf:

# /etc/systemd/system/nginx.service.d/override.conf
[Service]
LimitNOFILE=1048576

The chain: fs.file-max (kernel) >= fs.nr_open (per-process ceiling) >= hard limit (PAM/systemd) >= soft limit (ulimit). If any link is too low, you hit "too many open files."

Default trap: On systemd-managed services, LimitNOFILE defaults to 1024 (soft) and 524288 (hard) — regardless of what you set in /etc/security/limits.conf. The limits.conf file only affects PAM-based logins (SSH, console). For services, you must set LimitNOFILE in the systemd unit file or a drop-in override.

5. I/O Scheduler Selection

The I/O scheduler determines the order in which block I/O requests are sent to the disk. The right choice depends on the storage hardware.

# Check current scheduler for a device
cat /sys/block/sda/queue/scheduler
# [mq-deadline] none kyber bfq

# Change at runtime
echo mq-deadline > /sys/block/sda/queue/scheduler

# Persistent: udev rule
# /etc/udev/rules.d/60-io-scheduler.rules
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"
Scheduler Best For Why
none (noop) NVMe, SSDs, VMs Hardware or hypervisor handles scheduling; kernel reordering adds latency
mq-deadline SSDs, general purpose, databases Ensures no request starves; good latency guarantees
bfq Interactive desktops, mixed workloads Fair bandwidth allocation per process; higher CPU cost
kyber Fast SSDs, low-latency targets Lightweight; targets latency goals for read/write separately

For NVMe devices, the kernel uses none by default because the device has its own sophisticated I/O scheduler. Adding kernel-level scheduling on top just adds latency.

6. Transparent Huge Pages (THP)

Standard memory pages are 4KB. Huge pages are 2MB (or 1GB). Transparent huge pages let the kernel automatically promote 4KB pages to 2MB pages when it detects contiguous allocations.

# Check current THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

# Disable THP (recommended for databases)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Persistent: kernel command line (GRUB) or systemd tmpfile
# /etc/tmpfiles.d/thp.conf
w /sys/kernel/mm/transparent_hugepage/enabled - - - - never
w /sys/kernel/mm/transparent_hugepage/defrag - - - - never

THP sounds great in theory — fewer TLB misses — but causes latency spikes in practice for workloads with sparse memory access patterns. Redis, MongoDB, and PostgreSQL all recommend disabling THP.

War story: A common post-mortem pattern: periodic latency spikes on database servers that correlate with khugepaged CPU usage in top. The kernel's THP compaction daemon runs in the background, defragmenting memory to create 2 MB pages. During compaction, it can stall allocations for milliseconds — enough to cause p99 latency spikes that show up as user-visible slowdowns.

For workloads that genuinely benefit from huge pages (large JVMs, scientific computing), use explicit huge pages via hugetlbfs instead of THP. This gives you the performance benefit without the unpredictable compaction pauses.

# Allocate explicit huge pages (2MB each)
# Reserve 4096 huge pages = 8GB
sysctl -w vm.nr_hugepages=4096

# Verify
grep HugePages /proc/meminfo
# HugePages_Total:    4096
# HugePages_Free:     4096
# HugePages_Rsvd:        0
# Hugepagesize:       2048 kB

7. NUMA Balancing

Non-Uniform Memory Access (NUMA) means that memory access time depends on which CPU socket is accessing which memory bank. Local memory is fast; remote memory is slower.

# Check NUMA topology
numactl --hardware
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 32768 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 32768 MB

# Automatic NUMA balancing (kernel migrates pages to local node)
sysctl kernel.numa_balancing
# kernel.numa_balancing = 1

# Disable if you manage NUMA placement manually (e.g., numactl, cgroups)
sysctl -w kernel.numa_balancing=0

For database workloads, disabling automatic NUMA balancing and pinning the process to a single NUMA node often gives better, more predictable performance than letting the kernel migrate pages.

# Pin a process to NUMA node 0
numactl --cpunodebind=0 --membind=0 /usr/sbin/mysqld

# Check NUMA stats
numastat -c
numastat -p $(pidof mysqld)

8. Kernel Module Parameters

Some tuning happens at the kernel module level rather than through sysctl.

# List loaded modules
lsmod

# See parameters for a module
modinfo -p e1000e

# Set module parameter at load time
# /etc/modprobe.d/e1000e.conf
options e1000e InterruptThrottleRate=3

# Set parameter for already-loaded module (if writable)
echo 3 > /sys/module/e1000e/parameters/InterruptThrottleRate

9. Kernel Command Line (GRUB)

Some parameters can only be set at boot time via the kernel command line.

# View current kernel command line
cat /proc/cmdline

# Edit GRUB defaults
# /etc/default/grub
GRUB_CMDLINE_LINUX="transparent_hugepage=never numa=off isolcpus=0-3 nohz_full=4-15"

# Regenerate GRUB config
grub2-mkconfig -o /boot/grub2/grub.cfg    # RHEL/CentOS
update-grub                                  # Debian/Ubuntu

Common kernel command line parameters for tuning:

Parameter Purpose
transparent_hugepage=never Disable THP at boot
numa=off Disable NUMA (rare, testing only)
isolcpus=0-3 Isolate CPUs from the general scheduler
nohz_full=4-15 Reduce timer interrupts on specified CPUs
intel_idle.max_cstate=1 Prevent deep CPU sleep states (lower latency, more power)
processor.max_cstate=1 Same, ACPI-level control
mitigations=off Disable CPU vulnerability mitigations (performance gain, security risk)

10. Scheduler Tunables

The process scheduler has its own set of tunables for CPU-bound workload optimization.

# Minimum time a task runs before being preempted (nanoseconds)
# Lower = more responsive, higher = better throughput
sysctl kernel.sched_min_granularity_ns
# Default: 3000000 (3ms)

# Target scheduling latency (nanoseconds)
# The scheduler tries to run each task at least once within this window
sysctl kernel.sched_latency_ns
# Default: 24000000 (24ms)

# Wake-to-run latency — how quickly a woken task gets CPU
sysctl kernel.sched_wakeup_granularity_ns
# Default: 4000000 (4ms)

# For latency-sensitive workloads (reduce preemption delay)
sysctl -w kernel.sched_min_granularity_ns=1000000
sysctl -w kernel.sched_wakeup_granularity_ns=1500000

For real-time or ultra-low-latency workloads, consider the SCHED_FIFO or SCHED_RR scheduling policies via chrt, or look into the PREEMPT_RT kernel patch set. These are niche requirements — most production workloads do fine with CFS (Completely Fair Scheduler) tuning.

Summary: Common Tuning Profiles

High-Connection Web Server (nginx, HAProxy)

# /etc/sysctl.d/99-web-server.conf
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.ip_local_port_range = 1024 65535
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
fs.file-max = 2097152

Database Server (PostgreSQL, MySQL)

# /etc/sysctl.d/99-database.conf
vm.swappiness = 1
vm.dirty_background_ratio = 3
vm.dirty_ratio = 10
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
vm.max_map_count = 262144
# Disable THP via tmpfiles.d or kernel cmdline

Container Host (Kubernetes Node)

# /etc/sysctl.d/99-k8s-node.conf
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288
vm.max_map_count = 262144
fs.file-max = 2097152

Wiki Navigation

Prerequisites