Skip to content

Incident Replay: Asymmetric Routing — Traffic Works One Direction Only

Setup

  • System context: Multi-path network with two WAN links. Traffic to an external partner API works outbound but responses never arrive. The partner confirms they send responses.
  • Time: Tuesday 09:30 UTC
  • Your role: Network engineer / on-call SRE

Round 1: Alert Fires

[Pressure cue: "Partner API integration is broken. Our requests reach them (they see them in their logs) but we never get responses. Contract SLA breach in 2 hours."]

What you see: tcpdump -i eth0 host partner.api.com shows SYN packets leaving but no SYN-ACK returning. traceroute to the partner goes via WAN link A. The partner says they see our requests from our IP on WAN link B.

Choose your action: - A) Check if the partner is sending responses to the wrong IP - B) Check the routing table for asymmetric paths - C) Check if a firewall is dropping the return traffic - D) Contact the partner's network team

[Result: Outbound traffic to the partner exits via WAN link A (source IP: 203.0.113.1). But the partner's return traffic enters via WAN link B because their ISP routes to 203.0.113.1 via a different path. The firewall on WAN link B drops the return SYN-ACK because it has no matching session (the session was initiated on WAN link A). Classic asymmetric routing. Proceed to Round 2.]

If you chose A:

[Result: Partner is sending to the correct source IP. The issue is the return path, not the destination address.]

If you chose C:

[Result: The firewall IS dropping traffic, but for a legitimate reason — no session state for traffic arriving on a different interface. You need to find out why traffic is asymmetric.]

If you chose D:

[Result: Partner's network team is not the issue — they are routing correctly based on their BGP tables.]

Round 2: First Triage Data

[Pressure cue: "Asymmetric routing confirmed. Our outbound goes via link A but returns via link B where the firewall drops it."]

What you see: The network has two WAN links with default routes. Outbound traffic uses link A (lower metric). But the partner's ISP routes return traffic to the /24 that is announced via both links, and their BGP prefers link B's path. The stateful firewall on each link only accepts return traffic matching its own session table.

Choose your action: - A) Add a policy route to force partner traffic to always use link A (both directions) - B) Enable stateful session sync between the two firewalls - C) Disable stateful inspection for traffic to/from the partner - D) Ask the partner to pin their return route via link A

[Result: Add a policy-based route: traffic to/from the partner's IP range always uses WAN link A. Source NAT ensures outbound uses link A's IP. Return traffic from the partner arrives on link A where the firewall session exists. Connectivity restored. Proceed to Round 3.]

If you chose B:

[Result: Session sync between firewalls is complex and may introduce latency. Correct for HA failover but overkill for a single partner route.]

If you chose C:

[Result: Disabling stateful inspection weakens security. Not acceptable for production traffic.]

If you chose D:

[Result: You cannot control the partner's routing. Their ISP chooses the path based on BGP.]

Round 3: Root Cause Identification

[Pressure cue: "Partner connectivity restored. Document and prevent."]

What you see: Root cause: Dual WAN setup with stateful firewalls requires policy routing for critical partners to ensure symmetric paths. The network was designed for outbound load balancing but did not account for return path asymmetry with specific destinations.

Choose your action: - A) Add policy routes for all critical external partners - B) Implement source-based routing so return traffic always uses the same link as outbound - C) Deploy firewall session sync as a general solution - D) A + B for immediate fix and architectural improvement

[Result: Policy routes for known critical partners (immediate). Source-based routing using ip rule for general symmetric routing (architectural). Proceed to Round 4.]

If you chose A:

[Result: Fixes known partners but new integrations could hit the same issue.]

If you chose B:

[Result: General solution but requires careful testing to avoid breaking other flows.]

If you chose C:

[Result: Session sync is expensive and complex. Policy routing is simpler for this problem.]

Round 4: Remediation

[Pressure cue: "Routing fixed. Verify and close."]

Actions: 1. Verify partner API connectivity: curl https://partner.api.com/health 2. Verify return path is via link A: tcpdump confirms SYN-ACK on same interface as SYN 3. Test failover: verify traffic moves to link B cleanly if link A fails 4. Document policy routing requirements for dual-WAN setup 5. Add connectivity monitoring for all critical external partners

Damage Report

  • Total downtime: 0 (other services unaffected; only partner API broken)
  • Blast radius: Partner API integration down for 4 hours before detection
  • Optimal resolution time: 15 minutes (identify asymmetry -> add policy route -> verify)
  • If every wrong choice was made: 4+ hours with firewall changes and partner coordination

Cross-References