Incident Replay: Source Routing Policy Miss¶
Setup¶
- System context: Multi-homed server with traffic that should exit via specific WAN links based on source IP. Policy routing rules are configured but some traffic is using the wrong link.
- Time: Tuesday 14:00 UTC
- Your role: Network engineer / SRE
Round 1: Alert Fires¶
[Pressure cue: "Billing API traffic should exit via the dedicated WAN link (low-latency) but is going out the general-purpose link (high-latency). Billing transactions are timing out."]
What you see:
ip rule show has a policy route for source 10.1.1.0/24 to use table 100 (dedicated link). But ip route get 203.0.113.1 from 10.1.1.50 shows the traffic using the main table, not table 100. The ip rule priority for the policy route is 32766 — lower priority than the main table lookup at 32766.
Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue
If you chose A:¶
[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]
If you chose B (recommended):¶
[Result: Systematic investigation reveals the root cause. Policy routing rule was added with default priority, placing it at the same level as the main routing table lookup. The main table is evaluated first (or simultaneously) and matches, so the policy route never takes effect. Proceed to Round 2.]
If you chose C:¶
[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]
If you chose D:¶
[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]
Round 2: First Triage Data¶
[Pressure cue: "Root cause identified. Apply the fix."]
What you see: Policy routing rule was added with default priority, placing it at the same level as the main routing table lookup. The main table is evaluated first (or simultaneously) and matches, so the policy route never takes effect.
Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window
If you chose B (recommended):¶
[Result: Add the policy route with higher priority (lower number):
ip rule add from 10.1.1.0/24 table 100 priority 100. Delete the old rule. Verify withip route get. Service restored and verified. Proceed to Round 3.]
If you chose A:¶
[Result: Fix applied but not verified. May not be complete.]
If you chose C:¶
[Result: Broader fix is correct long-term but takes longer to implement during an incident.]
If you chose D:¶
[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]
Round 3: Root Cause Identification¶
[Pressure cue: "Service restored. Document and prevent recurrence."]
What you see: Root cause confirmed: Policy routing rule was added with default priority, placing it at the same level as the main routing table lookup. The main table is evaluated first (or simultaneously) and matches, so the policy route never takes effect.
Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above
If you chose D (recommended):¶
[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]
If you chose A:¶
[Result: Documentation helps but relies on humans remembering to check it.]
If you chose B:¶
[Result: Monitoring detects faster but does not prevent.]
If you chose C:¶
[Result: Automation prevents recurrence but needs monitoring for edge cases.]
Round 4: Remediation¶
[Pressure cue: "Verify everything and close the incident."]
Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure
Damage Report¶
- Total downtime: Varies based on path chosen
- Blast radius: Affected services and dependent systems
- Optimal resolution time: 15 minutes
- If every wrong choice was made: 90 minutes + additional damage
Cross-References¶
- Primer: Networking
- Primer: Networking Troubleshooting
- Footguns: Networking