Incident Replay: NAT Exhaustion — Intermittent Connectivity¶

Setup¶

System context: Production environment with 500+ servers behind a NAT gateway for outbound internet access. Intermittent connection failures to external APIs during peak hours.
Time: Wednesday 12:00 UTC
Your role: Network engineer / on-call SRE

Round 1: Alert Fires¶

[Pressure cue: "Multiple applications reporting intermittent 'Connection timed out' for external HTTPS calls. Failures spike at peak hours and resolve during off-peak."]

What you see: NAT gateway logs show 'port allocation failure' events. The single NAT IP has exhausted its ephemeral port range (65,535 ports). At peak, 60,000+ concurrent connections are active.

Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue

If you chose A:¶

[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]

If you chose B (recommended):¶

[Result: Systematic investigation reveals the root cause. Growing number of microservices each making external API calls. Connection pooling is not used — each request opens a new TCP connection. The single NAT IP cannot support >65K concurrent connections. Proceed to Round 2.]

If you chose C:¶

[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]

If you chose D:¶

[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]

Round 2: First Triage Data¶

[Pressure cue: "Root cause identified. Apply the fix."]

What you see: Growing number of microservices each making external API calls. Connection pooling is not used — each request opens a new TCP connection. The single NAT IP cannot support >65K concurrent connections.

Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window

If you chose B (recommended):¶

[Result: Add additional NAT IPs to expand the port pool. Implement connection pooling in applications. Set shorter TCP TIME_WAIT timeout. Monitor NAT port utilization. Service restored and verified. Proceed to Round 3.]

If you chose A:¶

[Result: Fix applied but not verified. May not be complete.]

If you chose C:¶

[Result: Broader fix is correct long-term but takes longer to implement during an incident.]

If you chose D:¶

[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]

Round 3: Root Cause Identification¶

[Pressure cue: "Service restored. Document and prevent recurrence."]

What you see: Root cause confirmed: Growing number of microservices each making external API calls. Connection pooling is not used — each request opens a new TCP connection. The single NAT IP cannot support >65K concurrent connections.

Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above

If you chose D (recommended):¶

[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]

If you chose A:¶

[Result: Documentation helps but relies on humans remembering to check it.]

If you chose B:¶

[Result: Monitoring detects faster but does not prevent.]

If you chose C:¶

[Result: Automation prevents recurrence but needs monitoring for edge cases.]

Round 4: Remediation¶

[Pressure cue: "Verify everything and close the incident."]

Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure

Damage Report¶

Total downtime: Varies based on path chosen
Blast radius: Affected services and dependent systems
Optimal resolution time: 10 minutes
If every wrong choice was made: 60 minutes + additional damage

Incident Replay: NAT Exhaustion — Intermittent Connectivity¶

Setup¶

Round 1: Alert Fires¶

If you chose A:¶

If you chose B (recommended):¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose D (recommended):¶

If you chose A:¶

If you chose B:¶

If you chose C:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶