Incident Replay: NAT Exhaustion — Intermittent Connectivity¶
Setup¶
- System context: Production environment with 500+ servers behind a NAT gateway for outbound internet access. Intermittent connection failures to external APIs during peak hours.
- Time: Wednesday 12:00 UTC
- Your role: Network engineer / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Multiple applications reporting intermittent 'Connection timed out' for external HTTPS calls. Failures spike at peak hours and resolve during off-peak."]
What you see: NAT gateway logs show 'port allocation failure' events. The single NAT IP has exhausted its ephemeral port range (65,535 ports). At peak, 60,000+ concurrent connections are active.
Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue
If you chose A:¶
[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]
If you chose B (recommended):¶
[Result: Systematic investigation reveals the root cause. Growing number of microservices each making external API calls. Connection pooling is not used — each request opens a new TCP connection. The single NAT IP cannot support >65K concurrent connections. Proceed to Round 2.]
If you chose C:¶
[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]
If you chose D:¶
[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]
Round 2: First Triage Data¶
[Pressure cue: "Root cause identified. Apply the fix."]
What you see: Growing number of microservices each making external API calls. Connection pooling is not used — each request opens a new TCP connection. The single NAT IP cannot support >65K concurrent connections.
Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window
If you chose B (recommended):¶
[Result: Add additional NAT IPs to expand the port pool. Implement connection pooling in applications. Set shorter TCP TIME_WAIT timeout. Monitor NAT port utilization. Service restored and verified. Proceed to Round 3.]
If you chose A:¶
[Result: Fix applied but not verified. May not be complete.]
If you chose C:¶
[Result: Broader fix is correct long-term but takes longer to implement during an incident.]
If you chose D:¶
[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]
Round 3: Root Cause Identification¶
[Pressure cue: "Service restored. Document and prevent recurrence."]
What you see: Root cause confirmed: Growing number of microservices each making external API calls. Connection pooling is not used — each request opens a new TCP connection. The single NAT IP cannot support >65K concurrent connections.
Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above
If you chose D (recommended):¶
[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]
If you chose A:¶
[Result: Documentation helps but relies on humans remembering to check it.]
If you chose B:¶
[Result: Monitoring detects faster but does not prevent.]
If you chose C:¶
[Result: Automation prevents recurrence but needs monitoring for edge cases.]
Round 4: Remediation¶
[Pressure cue: "Verify everything and close the incident."]
Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure
Damage Report¶
- Total downtime: Varies based on path chosen
- Blast radius: Affected services and dependent systems
- Optimal resolution time: 10 minutes
- If every wrong choice was made: 60 minutes + additional damage
Cross-References¶
- Primer: Networking
- Primer: Networking Troubleshooting
- Footguns: Networking