ARP Footguns¶

Mistakes that cause connectivity loss, security exposure, or cascading network failures.

1. Flushing ARP on a production host without testing¶

You run ip neigh flush dev eth0 to fix one stale entry. Every cached MAC is gone. For the next few seconds, the host re-ARPs for every destination simultaneously. On a busy server with hundreds of active connections, this causes a burst of packet drops and connection resets.

Fix: Delete the specific entry: ip neigh del 10.0.0.5 dev eth0. Only flush when you understand the blast radius.

Under the hood: After a flush, the host must ARP for every active destination simultaneously. On a busy server with hundreds of connections, this creates a burst of broadcast traffic and a window where packets are queued waiting for ARP resolution. Established TCP connections survive (retransmits), but latency-sensitive traffic (VoIP, real-time) will glitch.

2. Ignoring FAILED entries in `ip neigh show`¶

A FAILED entry means ARP resolution never completed. The host cannot reach that IP at Layer 2. You assume it is a transient glitch and move on. Meanwhile, the application retries silently, latency climbs, and the on-call page fires 30 minutes later.

Fix: Investigate immediately. Check that the target is on the same subnet, the switch port is up, and the target host is actually running.

3. Not setting `arp_ignore` on multi-homed hosts¶

A Linux host with two NICs on different subnets responds to ARP requests for eth0's IP on eth1. Traffic arrives on the wrong interface, bypasses firewall rules scoped to eth0, and may be dropped by reverse path filtering.

Fix: Set sysctl net.ipv4.conf.all.arp_ignore=1 and net.ipv4.conf.all.arp_announce=2 on any multi-homed host.

4. Relying on proxy ARP instead of fixing routing¶

Proxy ARP makes things "just work" across misconfigured subnets. Hosts on 10.0.0.0/24 can reach 10.0.1.0/24 without a proper route because the router answers ARP on their behalf. Then a network redesign removes the proxy, and dozens of hosts lose connectivity with no obvious cause.

Fix: Use proper static routes or a routing protocol. Proxy ARP is a band-aid, not a solution. Document it loudly if you must use it.

5. Leaving default ARP table limits on large networks¶

The default gc_thresh3 is 1024 entries. In a flat /22 or larger network, you exceed this easily. The kernel starts dropping ARP entries under pressure. Hosts randomly cannot reach each other for seconds at a time. dmesg shows neighbour table overflow but nobody checks dmesg proactively.

Fix: Increase thresholds before deployment: gc_thresh1=4096, gc_thresh2=8192, gc_thresh3=16384. Monitor ip neigh show | wc -l in your metrics.

Debug clue: If hosts on a flat network randomly lose connectivity for a few seconds at a time, check dmesg | grep neighbour. The kernel logs neighbour table overflow when gc_thresh3 is hit, but most monitoring tools do not scrape dmesg. Add a node_exporter textfile collector for this metric.

6. Forgetting gratuitous ARP after IP failover¶

You move a VIP to a new host manually (or your failover script skips the GARP step). The new host owns the IP, but every neighbor still has the old MAC in their cache. Traffic goes to the old host for minutes until ARP entries time out naturally.

Fix: Always send gratuitous ARP after IP moves: arping -U -c 5 -I eth0 <VIP>. Verify that your HA tool (keepalived, pacemaker) does this automatically.

Gotcha: Gratuitous ARP is also the mechanism used in ARP spoofing attacks (MITRE ATT&CK T1557.002). An attacker sends unsolicited ARP replies claiming ownership of the gateway IP. Enable Dynamic ARP Inspection (DAI) on managed switches to prevent spoofed GARPs while still allowing legitimate failover traffic.

7. Adding a static ARP entry with the wrong MAC¶

You add ip neigh add 10.0.0.1 lladdr aa:bb:cc:dd:ee:ff dev eth0 nud permanent to protect the gateway. The MAC is wrong — maybe copied from the wrong host. All traffic to the gateway goes to a nonexistent MAC and is silently dropped. Because the entry is PERMANENT, the kernel never re-resolves it.

Fix: Verify the MAC before adding: arping -c 1 -I eth0 10.0.0.1. Double-check with the switch MAC table.

8. Not monitoring ARP in container environments¶

Calico uses proxy ARP for its routing model. If a node's proxy ARP setting gets reset (kernel upgrade, sysctl override), pods on that node lose connectivity. You troubleshoot at the pod and service level for an hour before discovering it is a Layer 2 issue.

Fix: Include ARP health in node-level monitoring. Alert on unexpected changes to proxy_arp and arp_ignore sysctls.

War story: Calico's BGP-based networking model relies on proxy ARP to route pod traffic at Layer 2. A kernel upgrade or sysctl override that resets proxy_arp=0 on a node silently breaks pod networking on that node. The symptom looks like a CNI failure, but it is a Layer 2 issue. Check sysctl net.ipv4.conf.cali*.proxy_arp on affected nodes.

9. Assuming ARP works the same in cloud VPCs¶

You SSH into an AWS EC2 instance and run ip neigh show. The entries look normal, but ARP is handled by the hypervisor, not real broadcast. You try to use arping to detect duplicates or send gratuitous ARP — it does nothing because the VPC fabric intercepts ARP at the virtual switch layer.

Fix: In cloud environments, use the provider's API for IP conflict detection and failover (Elastic IPs, Azure floating IPs). Do not rely on ARP-based tools.

10. Running arping without specifying the interface¶

You run arping 10.0.0.1 on a multi-homed host. arping picks the wrong interface, sends the probe from the wrong source IP, and either gets no reply or gets a reply you cannot interpret. You conclude the target is down when it is actually fine.

Fix: Always specify the interface: arping -I eth0 10.0.0.1. On multi-homed hosts, the interface choice determines the source IP and the broadcast domain.