Portal | Level: L2: Operations | Topics: Routing, Linux Networking Tools | Domain: Networking
Scenario: Asymmetric Routing Through a Stateful Firewall¶
Situation¶
At 14:22 UTC, the monitoring team reports that a subset of application servers in subnet 10.20.30.0/24 can no longer reach the payment gateway at 203.0.113.50. The issue appeared suddenly after a network maintenance window that added a second upstream link for redundancy. Connections time out after 30 seconds, and the on-call engineer confirms that "ping works fine but TCP connections never complete."
What You Know¶
- The affected servers have two default gateways available via ECMP or a recently added static route
- A stateful firewall (iptables with conntrack) sits in front of the primary gateway
- Ping (ICMP) works to the destination, but TCP connections to port 443 hang
- The problem started after a second upstream link was added during the maintenance window
- Non-affected servers in a different subnet use only the primary gateway
Investigation Steps¶
1. Check the routing table for multiple paths¶
Command(s):
What to look for: Multiple default routes or routes to the destination via different next-hops. If you see twodefault via entries with equal metrics, ECMP is active and traffic may leave via one gateway but return via another.
2. Capture traffic to confirm the asymmetry¶
Command(s):
# On the application server — watch outbound SYNs and look for missing SYN-ACKs
tcpdump -nn -i eth0 host 203.0.113.50 and port 443
# On the firewall/primary gateway — check if return traffic arrives here
tcpdump -nn -i eth0 host 203.0.113.50 and port 443
3. Trace both directions to confirm the path difference¶
Command(s):
# From the app server to the destination
mtr -n -r -c 10 203.0.113.50
# Check conntrack on the firewall for dropped entries
conntrack -L -d 203.0.113.50
conntrack -S | grep drop
# Check firewall logs for dropped packets
dmesg | grep -i "nf_conntrack" | tail -20
iptables -L -v -n | grep -i drop
conntrack -S will show increasing drop counters.
4. Verify with a forced source route¶
Command(s):
# Temporarily force traffic to the destination through the primary gateway only
ip route add 203.0.113.50/32 via 10.20.30.1 dev eth0
# Test connectivity
curl -v --connect-timeout 5 https://203.0.113.50/health
Root Cause¶
The maintenance window added a second default gateway with equal cost. ECMP hashing sent some TCP SYN packets out via gateway B (the new link), but the remote server's SYN-ACK returned via gateway A (the original link) because the remote network's routing preferred that path. The stateful firewall on gateway A had no conntrack entry for this flow (the SYN never passed through it) and dropped the SYN-ACK as an INVALID packet. ICMP ping worked because it is stateless and the firewall allowed ESTABLISHED+RELATED and also standalone ICMP echo replies, or because ping happened to hash consistently to one path.
Fix¶
Immediate:
# Remove the equal-cost route causing the split, or pin critical traffic to one gateway
ip route del default via 10.20.30.2
# Or add a specific route for the payment gateway through the firewall path
ip route add 203.0.113.50/32 via 10.20.30.1
Preventive:
- Implement policy-based routing so that return traffic always leaves via the same gateway it arrived on (reverse path symmetry). Use ip rule and separate routing tables:
echo "100 isp1" >> /etc/iproute2/rt_tables
echo "200 isp2" >> /etc/iproute2/rt_tables
ip rule add from 10.20.30.0/24 lookup isp1
ip route add default via 10.20.30.1 table isp1
rp_filter) to detect asymmetry early rather than silently dropping.
- Add synthetic TCP health checks (not just ICMP) to monitoring for critical destinations.
Common Mistakes¶
- Assuming "ping works so the network is fine." Ping is ICMP and often gets different treatment than TCP through stateful firewalls.
- Blaming the remote server or the application. The SYN-ACK is being sent; it is just getting dropped before it reaches the client.
- Adding firewall rules to allow INVALID packets as a workaround. This masks the problem and opens security holes.
- Not checking both the outbound and return path. You must trace in both directions to identify asymmetry.
Interview Angle¶
Q: A server can ping an external host but TCP connections time out. What do you investigate? Good answer shape: Start with the distinction between ICMP and TCP behavior through stateful devices. Mention that ping success with TCP failure often points to a stateful firewall or NAT device that lacks the connection tracking entry for return traffic. Describe capturing with tcpdump to confirm SYNs leave but SYN-ACKs never arrive, then checking routing tables for multiple paths that could cause asymmetry. Explain that the fix involves ensuring traffic symmetry through policy routing or source-based routing rules, not by weakening firewall rules.
Wiki Navigation¶
Prerequisites¶
- Networking Deep Dive (Topic Pack, L1)
Related Content¶
- Case Study: Source Routing Policy Miss (Case Study, L2) — Linux Networking Tools, Routing
- Deep Dive: AWS VPC Internals (deep_dive, L2) — Linux Networking Tools, Routing
- Networking Deep Dive (Topic Pack, L1) — Linux Networking Tools, Routing
- Case Study: API Latency Spike — BGP Route Leak, Fix Is Network ACL (Case Study, L2) — Linux Networking Tools
- Case Study: ARP Flux Duplicate IP (Case Study, L2) — Linux Networking Tools
- Case Study: Asymmetric Routing One Direction (Case Study, L2) — Routing
- Case Study: BGP Peer Flapping (Case Study, L2) — Routing
- Case Study: DHCP Relay Broken (Case Study, L1) — Linux Networking Tools
- Case Study: Duplex Mismatch Symptoms (Case Study, L1) — Linux Networking Tools
- Case Study: IPTables Blocking Unexpected (Case Study, L2) — Linux Networking Tools