Skip to content

The Case of the Missing Packets

Category: The Mystery Domains: networking, firewalls Read time: ~5 min


Setting the Scene

We ran a microservices platform on bare-metal Linux servers behind a pair of HA firewalls. About 200 services, all talking over TCP. One Monday, tickets started rolling in: "Connection to service X times out sometimes." Not always -- maybe 1 in 50 connections would hang for 30 seconds and then fail. No pattern to which services were affected. No error messages. Just silence.

I was the network engineer on rotation, and I spent the first hour convinced it was an application problem.

What Happened

The first thing I did was run curl -v against an affected service a hundred times from a known-good host. Ninety-six successes, four hangs. The hangs weren't slow -- they were completely stuck. The TCP SYN went out, and nothing came back. No RST, no ICMP unreachable, just a black hole.

I ran tcpdump -i eth0 host 10.20.3.45 and port 8080 on both ends simultaneously. On the client side, I could see the SYN being sent. On the server side: nothing. The packet vanished somewhere in between.

Naturally, I blamed the firewall. I logged into the Palo Alto and checked the traffic logs. The successful connections were there. The failed ones simply weren't -- no deny, no drop, no log entry at all. As if the packets never arrived. I started suspecting a hardware issue -- maybe a bad GBIC, a flaky cable, a dying switch port.

I spent two days swapping cables, running ethtool -S to check for interface errors, and doing mtr traces. Everything looked clean. No packet loss on any individual hop.

Then I remembered something from a training course years ago: connection tracking. I SSH'd into the firewall and ran the equivalent of conntrack -C -- the conntrack table had 262,144 entries, which was exactly the default maximum (nf_conntrack_max). It was full.

I checked dmesg on the Linux hosts acting as internal routers: nf_conntrack: table full, dropping packet. There it was, buried in the kernel log. When the conntrack table fills up, new connections are silently dropped. No RST, no ICMP, no log in the firewall -- just gone.

The Moment of Truth

I ran sysctl net.netfilter.nf_conntrack_count and got back 262144 -- pegged at the max. A quick conntrack -L | awk '{print $4}' | sort | uniq -c | sort -rn | head showed that 80% of the entries were in TIME_WAIT state from a chatty service that opened thousands of short-lived connections per second without connection pooling.

We bumped nf_conntrack_max to 1,048,576 as an immediate fix, reduced nf_conntrack_tcp_timeout_time_wait from 120 to 30 seconds, and filed a ticket to add connection pooling to the offending service.

The Aftermath

Packet loss dropped to zero within minutes of the sysctl changes. We added Prometheus monitoring for nf_conntrack_count with an alert at 80% capacity. The chatty service got connection pooling the following sprint, which reduced its conntrack footprint by 95%. I also discovered that tcp_tw_reuse had been enabled on some hosts as a "performance optimization" months ago, which was actually contributing to conntrack entry churn. We disabled it.

The Lessons

  1. Monitor conntrack usage: nf_conntrack_count vs. nf_conntrack_max should be on every Linux host dashboard. A full table means silent packet drops with zero useful errors.
  2. Kernel defaults aren't production-ready: The default nf_conntrack_max of 262,144 is fine for a workstation, not for a server handling thousands of connections per second. Tune it during provisioning.
  3. tcp_tw_reuse has side effects: It can cause conntrack confusion and break connection tracking in stateful firewalls. Don't enable it without understanding the full implications.

What I'd Do Differently

Add conntrack monitoring to the base server provisioning playbook. Set nf_conntrack_max based on expected connection rate during capacity planning. And always check dmesg early -- the kernel was screaming the answer, and I just didn't look for two days.

The Quote

"The packets weren't being dropped by the firewall. They were being dropped by a counter nobody knew existed hitting a limit nobody had changed."

Cross-References