Incident Replay: ARP Flux — Duplicate IP Detection¶
Setup¶
- System context: Multi-homed server with two network interfaces on the same subnet. ARP responses are coming from both interfaces, causing intermittent connectivity and duplicate IP warnings on the network.
- Time: Monday 14:00 UTC
- Your role: Network engineer / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Network monitoring detects duplicate IP address alerts for 10.1.1.50. Intermittent connectivity to the production database server. DBA team escalating."]
What you see: ARP table on the switch shows the IP 10.1.1.50 flapping between two MAC addresses. The database server has two NICs (eth0 and eth1) both on the 10.1.1.0/24 subnet.
Choose your action: - A) Shut down one of the network interfaces on the server - B) Check the server's ARP and routing configuration for both interfaces - C) Add a static ARP entry on the switch for the correct MAC - D) Check for another device using the same IP
If you chose B (recommended):¶
[Result: The server has
net.ipv4.conf.all.arp_announce = 0andnet.ipv4.conf.all.arp_filter = 0(defaults). With these defaults, the kernel responds to ARP requests for 10.1.1.50 on BOTH interfaces, advertising two different MACs. This is ARP flux. Proceed to Round 2.]
If you chose A:¶
[Result: Disabling one NIC fixes the ARP flux but defeats the purpose of the multi-homed configuration (redundancy/throughput).]
If you chose C:¶
[Result: Static ARP on the switch pins one MAC but the server still responds with the other MAC to ARP requests from other devices.]
If you chose D:¶
[Result: No duplicate IP conflict from another device. Both MACs belong to the same server.]
Round 2: First Triage Data¶
[Pressure cue: "Database connectivity is intermittent — half the connections go to eth0, half to eth1. Some packets arrive on the wrong interface and get dropped."]
What you see: ARP flux occurs when a multi-homed host responds to ARP requests on all interfaces regardless of which interface owns the IP. The fix is to enable ARP filtering so each interface only responds for its own addresses.
Choose your action:
- A) Set net.ipv4.conf.all.arp_filter = 1
- B) Set net.ipv4.conf.all.arp_announce = 2 and arp_ignore = 1
- C) Bond the two interfaces instead of multi-homing
- D) Move each interface to a separate VLAN/subnet
If you chose B (recommended):¶
[Result:
sysctl -w net.ipv4.conf.all.arp_announce=2(use best local address) andsysctl -w net.ipv4.conf.all.arp_ignore=1(respond only if address is on the incoming interface). ARP flux stops. Each interface advertises only its own MAC. Proceed to Round 3.]
If you chose A:¶
[Result: arp_filter=1 uses routing table to filter ARP replies. Works for some topologies but arp_announce+arp_ignore is more precise.]
If you chose C:¶
[Result: Bonding is a valid alternative but changes the architecture. The current design intentionally uses separate IPs for different traffic classes.]
If you chose D:¶
[Result: Separate subnets would fix ARP flux but requires switch reconfiguration and IP changes. Large blast radius.]
Round 3: Root Cause Identification¶
[Pressure cue: "ARP flux resolved. Why was this not configured at provisioning time?"]
What you see: Root cause: The server was provisioned with multi-homed networking for separate application and backup traffic. The sysctl tuning for ARP behavior was not included in the provisioning playbook. Default kernel ARP settings cause flux on multi-homed hosts.
Choose your action: - A) Add sysctl settings to the Ansible provisioning playbook - B) Add the sysctl settings to /etc/sysctl.d/ on this server and make persistent - C) Document multi-homing requirements in the network architecture guide - D) All of the above
If you chose D (recommended):¶
[Result: Persistent sysctl config applied, provisioning playbook updated, architecture guide documents the requirement. Proceed to Round 4.]
If you chose A:¶
[Result: Future servers get it right but this server still needs the persistent config.]
If you chose B:¶
[Result: Fixes this server but other multi-homed servers may have the same issue.]
If you chose C:¶
[Result: Documentation helps but does not fix existing servers.]
Round 4: Remediation¶
[Pressure cue: "ARP stable. Verify and close."]
Actions:
1. Verify ARP is stable: arp -n on the switch shows single MAC per IP
2. Verify sysctl settings are persistent: sysctl net.ipv4.conf.all.arp_announce
3. Test failover by disconnecting each NIC briefly
4. Audit other multi-homed servers for the same issue
5. Update the provisioning playbook
Damage Report¶
- Total downtime: 0 (intermittent connectivity, not full outage)
- Blast radius: Database server intermittently unreachable; application retries masked some failures
- Optimal resolution time: 10 minutes (identify ARP flux -> set sysctl -> verify)
- If every wrong choice was made: 60+ minutes with NIC shutdowns and VLAN reconfigurations
Cross-References¶
- Primer: ARP
- Primer: Networking
- Primer: Linux Ops
- Footguns: Networking