Skip to content

Incident Replay: Proxy ARP Causing Issues

Setup

  • System context: Network with a /16 subnet that was split into /24 subnets. Some hosts on different /24s can still communicate even without proper routing because proxy ARP is enabled on the router.
  • Time: Friday 15:00 UTC
  • Your role: Network engineer

Round 1: Alert Fires

[Pressure cue: "Security audit finds hosts in different subnets communicating directly without going through the firewall. 'How is 10.1.1.50 reaching 10.1.2.50 without a route?'"]

What you see: ARP table on 10.1.1.50 shows the MAC address for 10.1.2.50 is the router's MAC. The router is answering ARP requests for 10.1.2.50 on behalf of that host — proxy ARP. Traffic goes through the router but bypasses firewall rules because it appears to be same-subnet traffic.

Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue

If you chose A:

[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]

[Result: Systematic investigation reveals the root cause. When the /16 was subnetted into /24s, proxy ARP was left enabled on the router. Some hosts still have /16 subnet masks (not updated) and send ARP requests for hosts in other /24s. The router responds with its own MAC, creating a 'transparent' bridge that bypasses security policy. Proceed to Round 2.]

If you chose C:

[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]

If you chose D:

[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]

Round 2: First Triage Data

[Pressure cue: "Root cause identified. Apply the fix."]

What you see: When the /16 was subnetted into /24s, proxy ARP was left enabled on the router. Some hosts still have /16 subnet masks (not updated) and send ARP requests for hosts in other /24s. The router responds with its own MAC, creating a 'transparent' bridge that bypasses security policy.

Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window

[Result: Disable proxy ARP on all router interfaces: no ip proxy-arp. Fix host subnet masks to /24. Update DHCP to serve /24 masks. Verify firewall rules apply to inter-subnet traffic. Service restored and verified. Proceed to Round 3.]

If you chose A:

[Result: Fix applied but not verified. May not be complete.]

If you chose C:

[Result: Broader fix is correct long-term but takes longer to implement during an incident.]

If you chose D:

[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]

Round 3: Root Cause Identification

[Pressure cue: "Service restored. Document and prevent recurrence."]

What you see: Root cause confirmed: When the /16 was subnetted into /24s, proxy ARP was left enabled on the router. Some hosts still have /16 subnet masks (not updated) and send ARP requests for hosts in other /24s. The router responds with its own MAC, creating a 'transparent' bridge that bypasses security policy.

Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above

[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]

If you chose A:

[Result: Documentation helps but relies on humans remembering to check it.]

If you chose B:

[Result: Monitoring detects faster but does not prevent.]

If you chose C:

[Result: Automation prevents recurrence but needs monitoring for edge cases.]

Round 4: Remediation

[Pressure cue: "Verify everything and close the incident."]

Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure

Damage Report

  • Total downtime: Varies based on path chosen
  • Blast radius: Affected services and dependent systems
  • Optimal resolution time: 12 minutes
  • If every wrong choice was made: 60 minutes + additional damage

Cross-References