Routing Footguns¶

Mistakes that blackhole traffic, cause outages, or create hard-to-diagnose connectivity issues.

1. Deleting the default route on a remote host¶

You run ip route del default while SSH'd into the server. Your session stays alive briefly, then dies. The host can only communicate with directly-connected networks. You need console access to restore the route.

Fix: Before modifying the default route, add the new one first: ip route add default via <new-gw>. Or schedule a rollback: echo "ip route add default via 10.0.0.1" | at now + 5 minutes.

Gotcha: The at rollback trick only works if atd is running. A safer alternative: sleep 300 && ip route add default via 10.0.0.1 & — if your change works, kill the background job. If it fails, the route is restored in 5 minutes. Always have out-of-band access (IPMI/BMC, serial console, cloud console) before touching routes on remote hosts.

2. Adding a more-specific route that hijacks traffic¶

You add ip route add 10.0.0.0/8 via 10.0.99.1 to reach a remote network. This route is more specific than the default but less specific than existing /24 routes. It captures all 10.x.x.x traffic not covered by more specific routes — including traffic that should go through VPN tunnels or container networks.

Fix: Use the most specific prefix possible. Add 10.200.0.0/16 instead of 10.0.0.0/8. Check ip route show for overlapping routes before adding.

3. Forgetting that routes are not persistent¶

You fix a routing issue with ip route add. The host reboots. The route is gone. The problem returns. This cycle repeats until someone persists the route in the network configuration.

Fix: Always persist route changes. nmcli: nmcli con mod eth0 +ipv4.routes "10.200.0.0/16 10.0.0.254". Netplan: add routes: to the interface config. Verify routes survive a reboot.

4. Not checking `ip rule show` for policy-based routing¶

Traffic is not following the expected route. You stare at ip route show and everything looks correct. But an ip rule is redirecting traffic from certain source addresses to a different routing table. Policy routes are invisible in the main table.

Fix: Always check ip rule show and ip route show table all when debugging routing issues. PBR rules evaluate before the main table.

Debug clue: ip route get <destination> shows you which route the kernel will actually use for a specific destination, including policy routing table lookups. This is the single most useful command for routing debugging — it tells you the answer, not just the configuration.

5. Asymmetric routing through a stateful firewall¶

Forward traffic goes through the firewall; return traffic takes a different path. The firewall never sees the return packet and drops it as invalid. TCP connections reset randomly. The issue is intermittent because only some flows are affected.

Fix: Ensure stateful devices (firewalls, NAT) see both directions of a flow. Check reverse path with ip route get <source> from <dest> from both endpoints. Consider rp_filter settings.

6. Docker and Kubernetes injecting surprise routes¶

You add a Docker bridge or deploy a Kubernetes cluster. Suddenly 172.17.0.0/16 and 10.244.0.0/16 routes appear. These can conflict with existing corporate subnets. Traffic to your 172.17.x.x office network now goes to a Docker bridge instead.

Fix: Configure Docker and Kubernetes with non-conflicting CIDRs before deployment. Docker: --bip and --default-address-pools. Kubernetes: --pod-network-cidr and --service-cidr.

Default trap: Docker defaults to 172.17.0.0/16 for the bridge and 172.18-31.0.0/16 for user-defined networks. Kubernetes Flannel defaults to 10.244.0.0/16, Calico to 192.168.0.0/16. If your corporate network uses any of these RFC 1918 ranges, you will have a conflict. Check before deploying — changing CIDRs after deployment is painful.

7. Using device names instead of gateway IPs in routes¶

You add ip route add 10.200.0.0/16 dev eth0. This creates a route that assumes the destination is directly reachable on eth0 (on-link). The kernel sends ARP requests for every destination IP instead of sending to a gateway. ARP floods ensue and nothing actually works.

Fix: Always specify via <gateway-ip> unless the destination is truly on the same L2 segment. Use dev only for connected subnets.

8. Route flapping from unstable links¶

A link goes up and down repeatedly. The routing daemon adds and removes routes each time. Every topology change triggers reconvergence. Downstream hosts see intermittent connectivity. The root cause is a bad cable or flaky SFP, but the symptom looks like a routing problem.

Fix: Monitor ip monitor route for rapid changes. Check the underlying link: ethtool -S ethX | grep error. Set dampening on dynamic routing protocols to suppress flapping routes.

9. Conflicting metrics on multiple default routes¶

You have two default routes with the same metric — DHCP on eth0 and a VPN on wg0. The kernel uses the first one added, but after a DHCP renewal, the order may change. Traffic randomly switches between paths depending on timing.

Fix: Set explicit metrics. Primary: ip route add default via 10.0.0.1 metric 100. Secondary: ip route add default via 10.0.1.1 metric 200. Lower metric wins.

10. Not verifying the next-hop is actually reachable¶

You add a static route ip route add 10.200.0.0/16 via 10.0.0.254. The gateway 10.0.0.254 does not exist or is down. The route is accepted silently. Traffic matches this route and is forwarded to a gateway that never responds. ARP shows FAILED for the next-hop.

Fix: Always verify the next-hop is alive before adding a route: ping -c 1 10.0.0.254. Check ip neigh show 10.0.0.254 after adding the route to confirm ARP resolves.