Solution: TCP Connections Reset After Idle Period¶

Triage¶

Confirm the symptom pattern -- connections die after a consistent idle interval:

grep "Connection reset\|Broken pipe" /var/log/app/application.log | tail -20

Capture traffic on the server to see the RST:
```
tcpdump -i eth0 -nn 'tcp[tcpflags] & (tcp-rst) != 0' -w /tmp/rst.pcap
```
Wait for a reset event, then analyze.

Check the firewall conntrack timeout:

# Linux-based firewall
sysctl net.netfilter.nf_conntrack_tcp_timeout_established
# Or on commercial firewall
show firewall session timeout

Check OS-level TCP keepalive on the application servers:

sysctl net.ipv4.tcp_keepalive_time
sysctl net.ipv4.tcp_keepalive_intvl
sysctl net.ipv4.tcp_keepalive_probes

Verify the application enables TCP keepalive on its sockets.

Root Cause¶

The stateful firewall maintains a connection tracking (conntrack) table. Each TCP connection has an idle timeout -- in this case, 3600 seconds (1 hour). When a connection is idle longer than this timeout, the firewall removes the conntrack entry.

When the application later sends data on this connection, the firewall sees a packet for an unknown session and sends a TCP RST back to the sender (or simply drops the packet, causing the other side's retransmit to eventually trigger a reset).

The application either does not enable TCP keepalive at all, or uses the Linux default of 7200 seconds (2 hours), which is longer than the firewall's 3600-second timeout. The keepalive never fires before the firewall evicts the session.

Fix¶

Option A -- Enable/Reduce TCP Keepalive (Preferred):

Set OS-level keepalive below the firewall timeout:

sysctl -w net.ipv4.tcp_keepalive_time=1800
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=5

Persist in /etc/sysctl.d/99-keepalive.conf.

Ensure the application enables SO_KEEPALIVE on its sockets.

Option B -- Increase Firewall Timeout:

sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400

Note: increasing timeout uses more firewall memory for conntrack entries.

Option C -- Application-Level Heartbeat: Configure the application to send periodic heartbeat/ping on idle connections (e.g., database connection pool validation query every 15 minutes).

Rollback / Safety¶

Lowering tcp_keepalive_time generates more keepalive traffic; ensure this is acceptable on high-connection-count servers.
Increasing firewall conntrack timeout may exhaust conntrack table on busy firewalls; monitor with conntrack -C.
Test with a single server before rolling out fleet-wide.

Common Traps¶

Assuming the RST comes from the server application -- it actually comes from the firewall (or is triggered by the firewall dropping the stale session).
Setting TCP keepalive at the OS level but the application not enabling SO_KEEPALIVE on its sockets -- OS settings are only defaults for sockets that opt in.
Confusing TCP keepalive with HTTP keep-alive -- they are completely different mechanisms.
Only fixing one side -- if both client and server traverse the firewall, both should enable keepalive.
Not persisting sysctl changes, causing regression after reboot.