Incident Replay: Proxy ARP Causing Issues¶
Setup¶
- System context: Network with a /16 subnet that was split into /24 subnets. Some hosts on different /24s can still communicate even without proper routing because proxy ARP is enabled on the router.
- Time: Friday 15:00 UTC
- Your role: Network engineer
Round 1: Alert Fires¶
[Pressure cue: "Security audit finds hosts in different subnets communicating directly without going through the firewall. 'How is 10.1.1.50 reaching 10.1.2.50 without a route?'"]
What you see: ARP table on 10.1.1.50 shows the MAC address for 10.1.2.50 is the router's MAC. The router is answering ARP requests for 10.1.2.50 on behalf of that host — proxy ARP. Traffic goes through the router but bypasses firewall rules because it appears to be same-subnet traffic.
Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue
If you chose A:¶
[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]
If you chose B (recommended):¶
[Result: Systematic investigation reveals the root cause. When the /16 was subnetted into /24s, proxy ARP was left enabled on the router. Some hosts still have /16 subnet masks (not updated) and send ARP requests for hosts in other /24s. The router responds with its own MAC, creating a 'transparent' bridge that bypasses security policy. Proceed to Round 2.]
If you chose C:¶
[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]
If you chose D:¶
[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]
Round 2: First Triage Data¶
[Pressure cue: "Root cause identified. Apply the fix."]
What you see: When the /16 was subnetted into /24s, proxy ARP was left enabled on the router. Some hosts still have /16 subnet masks (not updated) and send ARP requests for hosts in other /24s. The router responds with its own MAC, creating a 'transparent' bridge that bypasses security policy.
Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window
If you chose B (recommended):¶
[Result: Disable proxy ARP on all router interfaces:
no ip proxy-arp. Fix host subnet masks to /24. Update DHCP to serve /24 masks. Verify firewall rules apply to inter-subnet traffic. Service restored and verified. Proceed to Round 3.]
If you chose A:¶
[Result: Fix applied but not verified. May not be complete.]
If you chose C:¶
[Result: Broader fix is correct long-term but takes longer to implement during an incident.]
If you chose D:¶
[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]
Round 3: Root Cause Identification¶
[Pressure cue: "Service restored. Document and prevent recurrence."]
What you see: Root cause confirmed: When the /16 was subnetted into /24s, proxy ARP was left enabled on the router. Some hosts still have /16 subnet masks (not updated) and send ARP requests for hosts in other /24s. The router responds with its own MAC, creating a 'transparent' bridge that bypasses security policy.
Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above
If you chose D (recommended):¶
[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]
If you chose A:¶
[Result: Documentation helps but relies on humans remembering to check it.]
If you chose B:¶
[Result: Monitoring detects faster but does not prevent.]
If you chose C:¶
[Result: Automation prevents recurrence but needs monitoring for edge cases.]
Round 4: Remediation¶
[Pressure cue: "Verify everything and close the incident."]
Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure
Damage Report¶
- Total downtime: Varies based on path chosen
- Blast radius: Affected services and dependent systems
- Optimal resolution time: 12 minutes
- If every wrong choice was made: 60 minutes + additional damage
Cross-References¶
- Primer: Networking
- Primer: Networking Troubleshooting
- Footguns: Networking