Incident Replay: TCP RST After Idle¶
Setup¶
- System context: Long-lived database connections between application servers and a database cluster. Connections drop with 'Connection reset by peer' after exactly 15 minutes of idle time.
- Time: Wednesday 07:00 UTC
- Your role: On-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Application logs show 'Connection reset by peer' for database connections that have been idle for exactly 900 seconds. Connection pools are being exhausted."]
What you see: The database server is not resetting the connections — it still shows them as ESTABLISHED. A firewall between the app servers and database has a session timeout of 900 seconds (15 minutes). The firewall drops the conntrack entry after idle, and the next packet from either side gets a RST.
Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue
If you chose A:¶
[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]
If you chose B (recommended):¶
[Result: Systematic investigation reveals the root cause. Firewall conntrack idle timeout (900s) is shorter than the application's connection pool idle timeout (3600s). The firewall drops the session state, and subsequent packets on the connection are treated as invalid and get RST responses. Proceed to Round 2.]
If you chose C:¶
[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]
If you chose D:¶
[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]
Round 2: First Triage Data¶
[Pressure cue: "Root cause identified. Apply the fix."]
What you see: Firewall conntrack idle timeout (900s) is shorter than the application's connection pool idle timeout (3600s). The firewall drops the session state, and subsequent packets on the connection are treated as invalid and get RST responses.
Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window
If you chose B (recommended):¶
[Result: Either increase the firewall conntrack timeout to exceed the app's idle timeout, or enable TCP keepalives on the database connections with an interval shorter than 900 seconds:
sysctl -w net.ipv4.tcp_keepalive_time=300. Service restored and verified. Proceed to Round 3.]
If you chose A:¶
[Result: Fix applied but not verified. May not be complete.]
If you chose C:¶
[Result: Broader fix is correct long-term but takes longer to implement during an incident.]
If you chose D:¶
[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]
Round 3: Root Cause Identification¶
[Pressure cue: "Service restored. Document and prevent recurrence."]
What you see: Root cause confirmed: Firewall conntrack idle timeout (900s) is shorter than the application's connection pool idle timeout (3600s). The firewall drops the session state, and subsequent packets on the connection are treated as invalid and get RST responses.
Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above
If you chose D (recommended):¶
[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]
If you chose A:¶
[Result: Documentation helps but relies on humans remembering to check it.]
If you chose B:¶
[Result: Monitoring detects faster but does not prevent.]
If you chose C:¶
[Result: Automation prevents recurrence but needs monitoring for edge cases.]
Round 4: Remediation¶
[Pressure cue: "Verify everything and close the incident."]
Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure
Damage Report¶
- Total downtime: Varies based on path chosen
- Blast radius: Affected services and dependent systems
- Optimal resolution time: 12 minutes
- If every wrong choice was made: 60 minutes + additional damage
Cross-References¶
- Primer: Networking
- Primer: Networking Troubleshooting
- Footguns: Networking