Linux Kernel Tuning Footguns¶

Mistakes that silently degrade performance, cause outages under load, or create problems that only appear months later.

1. sysctl changes not persisting across reboot¶

You tune net.core.somaxconn = 65535 at runtime with sysctl -w. The server reboots for a kernel update. The value reverts to the default 4096. Your high-connection service starts dropping connections under peak load. Nobody connects the reboot to the connection drops because the symptoms do not appear until the next traffic spike.

Fix: Always persist changes in /etc/sysctl.d/. After writing the file, test with sysctl --system and verify with sysctl net.core.somaxconn. Schedule a reboot test in staging to confirm persistence. Use configuration management (Ansible) to enforce the values across fleet reboots.

Gotcha: Even with a file in /etc/sysctl.d/, the value may be overridden by a later file (alphabetical order) or by a service that sets sysctls at startup (Docker, Kubernetes kubelet, Elasticsearch). After reboot, always verify the actual runtime value with sysctl <key>, not just that your config file exists.

2. vm.swappiness=0 meaning different things on different kernels¶

On kernels before 3.5, vm.swappiness=0 meant "prefer not to swap but still will." On kernels 3.5+, it means "do not swap until the system is about to OOM." You set it to 0 on a modern kernel thinking it is a gentle preference, but the system does not swap at all until it is critically low on memory — then the OOM killer fires and takes out your database process instead of gracefully paging out idle memory.

Fix: Use vm.swappiness=1 instead of 0 for "almost never swap." It preserves the escape hatch of swapping under extreme pressure without disabling it entirely. Check your kernel version with uname -r before copying sysctl recipes from the internet.

3. Transparent huge pages causing latency spikes¶

THP is enabled by default (always) on most distributions. For databases and latency-sensitive applications, this causes periodic latency spikes during memory compaction — the kernel pauses to defragment physical memory into contiguous 2MB chunks. These pauses can be 10-100ms and appear as unexplained P99 latency spikes in your application metrics.

Redis, MongoDB, PostgreSQL, and Elasticsearch all document this issue and recommend disabling THP. The symptoms are insidious: normal P50 latency is fine, but P99 and P999 have random spikes that do not correlate with traffic.

Fix: Disable THP for database workloads.

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Persist via /etc/tmpfiles.d/thp.conf or kernel command line transparent_hugepage=never. If you need huge pages, use explicit hugetlbfs allocation.

4. tcp_tw_recycle breaking NAT¶

net.ipv4.tcp_tw_recycle was a parameter that aggressively recycled TIME_WAIT sockets using TCP timestamps. It worked correctly for direct client-server connections but broke catastrophically when clients were behind NAT (load balancers, corporate firewalls, carrier-grade NAT). Different clients behind the same NAT IP had different TCP timestamps, and the kernel would drop connections from clients whose timestamps appeared to "go backward."

The symptom was intermittent: some clients could connect and others got silent drops, with no errors in the server logs.

Fix: Never use tcp_tw_recycle. It was removed entirely from the kernel in version 4.12. Use tcp_tw_reuse = 1 instead — it only affects outbound connections from the server and is safe behind NAT. If you find tcp_tw_recycle in old configuration management recipes, remove it.

Debug clue: If users behind NAT (corporate office, mobile carrier) report intermittent connection failures while direct-connected users work fine, check for tcp_tw_recycle=1. The symptom is that connections from some clients behind the same NAT IP succeed while others are silently dropped — no RST, no error, just a timeout. ss -s will show a normal number of TIME_WAIT sockets, making this hard to spot without knowing the parameter exists.

5. Setting limits too high without understanding resource impact¶

You set fs.file-max = 2147483647 and LimitNOFILE=infinity because "bigger is better." A buggy application with a file descriptor leak now opens millions of file descriptors before anything notices. The kernel allocates per-fd structures that consume memory. At a million open fds, you are using ~1GB of kernel memory for file descriptor metadata alone. At 10 million, the system may become unresponsive.

Fix: Set limits high enough for your workload with reasonable headroom, not to the maximum possible value. fs.file-max = 2097152 (2M) and LimitNOFILE = 1048576 (1M) are generous for even the most demanding services. Monitor cat /proc/sys/fs/file-nr to see actual usage and set alerts when usage exceeds 80% of the limit.

6. Kernel tuning without benchmarking¶

You read a blog post about kernel tuning for high-performance servers. You apply all the recommended sysctls to your servers. Performance does not improve — or worse, it gets worse. You changed 15 parameters at once and have no idea which one helped, which one hurt, and whether the net effect is positive.

Fix: Change one parameter at a time. Measure before and after with your actual workload (not a synthetic benchmark). Use nstat, vmstat, ss, and your application metrics as the source of truth. If you cannot measure the impact, you cannot justify the change. Keep a log of what you changed, when, and what the measured effect was.

Remember: The kernel defaults are the result of decades of real-world testing across millions of systems. They are optimized for the general case. Changing a default without measuring means you're asserting you know better than the kernel developers for your specific workload — which you might, but only if you can prove it with data.

7. overcommit_memory=1 masking real OOM issues¶

Redis documentation recommends vm.overcommit_memory = 1 to allow background persistence via fork(). You set it globally. Now malloc() never fails for any process on the system. A memory-leaking application allocates 100GB of virtual memory on a 32GB box without getting an error. When those pages are actually touched, the OOM killer fires and kills something — often not the leaking process, but the most memory-hungry legitimate process (your database).

Fix: If you must use overcommit_memory = 1 for Redis, understand the trade-off. Monitor RSS and virtual memory usage across all processes. Set OOM score adjustments (oom_score_adj) to protect critical processes:

# Protect the database from OOM killer
echo -1000 > /proc/$(pidof postgres)/oom_score_adj
# Make Redis more likely to be killed than the DB
echo 200 > /proc/$(pidof redis-server)/oom_score_adj

Consider overcommit_memory = 2 with overcommit_ratio tuned for your workload as a more predictable alternative.

8. Forgetting to increase both ulimit and fs.file-max¶

You set fs.file-max = 2097152 in sysctl but forget to update /etc/security/limits.conf or the systemd unit's LimitNOFILE. The per-process limit is still 1024 (the PAM default on many distributions). Your application still hits "too many open files" because the per-process limit is the binding constraint, not the system-wide limit.

The chain is: fs.file-max (system) >= fs.nr_open (per-process ceiling) >= hard limit (PAM/systemd) >= soft limit (ulimit). Every link must be wide enough.

Fix: Set all four levels. Verify with cat /proc/<pid>/limits for the running process. For systemd services, set LimitNOFILE in the unit file — systemd does not read limits.conf.

9. sysctl.d file ordering causing unexpected overrides¶

You create /etc/sysctl.d/99-tuning.conf with net.core.somaxconn = 65535. But another package installed /etc/sysctl.d/99-sysctl.conf (some distributions create this) that sets net.core.somaxconn = 4096. Files are processed alphabetically: 99-sysctl.conf runs after 99-tuning.conf, overwriting your value.

Fix: Use distinctive prefixes. Check what files exist with ls -la /etc/sysctl.d/ /usr/lib/sysctl.d/ /run/sysctl.d/. After running sysctl --system, verify the actual value. On systemd systems, systemd-analyze cat-config sysctl.d shows the merged configuration with file sources.

Under the hood: systemd processes sysctl.d files from three directories in this order: /usr/lib/sysctl.d/ (vendor), /run/sysctl.d/ (runtime), /etc/sysctl.d/ (admin). Within each directory, files are sorted alphabetically. A file in /etc/ overrides the same filename in /usr/lib/. Within the same directory, 99-z.conf beats 99-a.conf. The last write wins for any given key.