Solution: TCP Connections Reset After Idle Period¶
Triage¶
-
Confirm the symptom pattern -- connections die after a consistent idle interval:
-
Capture traffic on the server to see the RST:
Wait for a reset event, then analyze. -
Check the firewall conntrack timeout:
-
Check OS-level TCP keepalive on the application servers:
-
Verify the application enables TCP keepalive on its sockets.
Root Cause¶
The stateful firewall maintains a connection tracking (conntrack) table. Each TCP connection has an idle timeout -- in this case, 3600 seconds (1 hour). When a connection is idle longer than this timeout, the firewall removes the conntrack entry.
When the application later sends data on this connection, the firewall sees a packet for an unknown session and sends a TCP RST back to the sender (or simply drops the packet, causing the other side's retransmit to eventually trigger a reset).
The application either does not enable TCP keepalive at all, or uses the Linux default of 7200 seconds (2 hours), which is longer than the firewall's 3600-second timeout. The keepalive never fires before the firewall evicts the session.
Fix¶
Option A -- Enable/Reduce TCP Keepalive (Preferred):
Set OS-level keepalive below the firewall timeout:
sysctl -w net.ipv4.tcp_keepalive_time=1800
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=5
/etc/sysctl.d/99-keepalive.conf.
Ensure the application enables SO_KEEPALIVE on its sockets.
Option B -- Increase Firewall Timeout:
Note: increasing timeout uses more firewall memory for conntrack entries.Option C -- Application-Level Heartbeat: Configure the application to send periodic heartbeat/ping on idle connections (e.g., database connection pool validation query every 15 minutes).
Rollback / Safety¶
- Lowering tcp_keepalive_time generates more keepalive traffic; ensure this is acceptable on high-connection-count servers.
- Increasing firewall conntrack timeout may exhaust conntrack table on busy
firewalls; monitor with
conntrack -C. - Test with a single server before rolling out fleet-wide.
Common Traps¶
- Assuming the RST comes from the server application -- it actually comes from the firewall (or is triggered by the firewall dropping the stale session).
- Setting TCP keepalive at the OS level but the application not enabling SO_KEEPALIVE on its sockets -- OS settings are only defaults for sockets that opt in.
- Confusing TCP keepalive with HTTP keep-alive -- they are completely different mechanisms.
- Only fixing one side -- if both client and server traverse the firewall, both should enable keepalive.
- Not persisting sysctl changes, causing regression after reboot.