Skip to content

Solution: TCP Connections Reset After Idle Period

Triage

  1. Confirm the symptom pattern -- connections die after a consistent idle interval:

    grep "Connection reset\|Broken pipe" /var/log/app/application.log | tail -20
    

  2. Capture traffic on the server to see the RST:

    tcpdump -i eth0 -nn 'tcp[tcpflags] & (tcp-rst) != 0' -w /tmp/rst.pcap
    
    Wait for a reset event, then analyze.

  3. Check the firewall conntrack timeout:

    # Linux-based firewall
    sysctl net.netfilter.nf_conntrack_tcp_timeout_established
    # Or on commercial firewall
    show firewall session timeout
    

  4. Check OS-level TCP keepalive on the application servers:

    sysctl net.ipv4.tcp_keepalive_time
    sysctl net.ipv4.tcp_keepalive_intvl
    sysctl net.ipv4.tcp_keepalive_probes
    

  5. Verify the application enables TCP keepalive on its sockets.

Root Cause

The stateful firewall maintains a connection tracking (conntrack) table. Each TCP connection has an idle timeout -- in this case, 3600 seconds (1 hour). When a connection is idle longer than this timeout, the firewall removes the conntrack entry.

When the application later sends data on this connection, the firewall sees a packet for an unknown session and sends a TCP RST back to the sender (or simply drops the packet, causing the other side's retransmit to eventually trigger a reset).

The application either does not enable TCP keepalive at all, or uses the Linux default of 7200 seconds (2 hours), which is longer than the firewall's 3600-second timeout. The keepalive never fires before the firewall evicts the session.

Fix

Option A -- Enable/Reduce TCP Keepalive (Preferred):

Set OS-level keepalive below the firewall timeout:

sysctl -w net.ipv4.tcp_keepalive_time=1800
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=5
Persist in /etc/sysctl.d/99-keepalive.conf.

Ensure the application enables SO_KEEPALIVE on its sockets.

Option B -- Increase Firewall Timeout:

sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
Note: increasing timeout uses more firewall memory for conntrack entries.

Option C -- Application-Level Heartbeat: Configure the application to send periodic heartbeat/ping on idle connections (e.g., database connection pool validation query every 15 minutes).

Rollback / Safety

  • Lowering tcp_keepalive_time generates more keepalive traffic; ensure this is acceptable on high-connection-count servers.
  • Increasing firewall conntrack timeout may exhaust conntrack table on busy firewalls; monitor with conntrack -C.
  • Test with a single server before rolling out fleet-wide.

Common Traps

  • Assuming the RST comes from the server application -- it actually comes from the firewall (or is triggered by the firewall dropping the stale session).
  • Setting TCP keepalive at the OS level but the application not enabling SO_KEEPALIVE on its sockets -- OS settings are only defaults for sockets that opt in.
  • Confusing TCP keepalive with HTTP keep-alive -- they are completely different mechanisms.
  • Only fixing one side -- if both client and server traverse the firewall, both should enable keepalive.
  • Not persisting sysctl changes, causing regression after reboot.