Skip to content

Anti-Primer: ARP

Everything that can go wrong, will — and in this story, it does.

The Setup

A network engineer is troubleshooting intermittent connectivity in a datacenter VLAN. Servers sporadically lose connection to the gateway. The engineer suspects ARP issues but starts making changes without fully understanding the ARP table state.

The Timeline

Hour 0: Static ARP Entry Typo

Adds a static ARP entry with the wrong MAC address for the gateway. The deadline was looming, and this seemed like the fastest path forward. But the result is all traffic from the server is sent to a nonexistent MAC; complete network isolation.

Footgun #1: Static ARP Entry Typo — adds a static ARP entry with the wrong MAC address for the gateway, leading to all traffic from the server is sent to a nonexistent MAC; complete network isolation.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: ARP Cache Timeout Too Long

Sets ARP timeout to 24 hours to 'reduce broadcast traffic'. Under time pressure, the team chose speed over caution. But the result is when a NIC is replaced, the old MAC is cached for a full day; server is unreachable.

Footgun #2: ARP Cache Timeout Too Long — sets ARP timeout to 24 hours to 'reduce broadcast traffic', leading to when a NIC is replaced, the old MAC is cached for a full day; server is unreachable.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Gratuitous ARP Ignored

Disables gratuitous ARP acceptance on a failover cluster. Nobody pushed back because the shortcut looked harmless in the moment. But the result is after failover, the standby server's new IP-to-MAC mapping is not learned; traffic goes to the dead primary.

Footgun #3: Gratuitous ARP Ignored — disables gratuitous ARP acceptance on a failover cluster, leading to after failover, the standby server's new IP-to-MAC mapping is not learned; traffic goes to the dead primary.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: ARP Flood from Misconfigured Host

A misconfigured container sends ARP replies for IPs it does not own. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is ARP tables across the VLAN are poisoned; traffic between servers is misdirected.

Footgun #4: ARP Flood from Misconfigured Host — a misconfigured container sends ARP replies for IPs it does not own, leading to ARP tables across the VLAN are poisoned; traffic between servers is misdirected.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Static ARP Entry Typo All traffic from the server is sent to a nonexistent MAC; complete network isolation Primer: Verify MAC addresses from arp -a or switch MAC table before adding static entries
2 ARP Cache Timeout Too Long When a NIC is replaced, the old MAC is cached for a full day; server is unreachable Primer: Keep ARP timeout at reasonable defaults (minutes, not hours)
3 Gratuitous ARP Ignored After failover, the standby server's new IP-to-MAC mapping is not learned; traffic goes to the dead primary Primer: Enable gratuitous ARP on networks with failover or VRRP
4 ARP Flood from Misconfigured Host ARP tables across the VLAN are poisoned; traffic between servers is misdirected Primer: Enable Dynamic ARP Inspection (DAI) on the switch; use ARP filtering on hosts

Damage Report

  • Downtime: 1-4 hours of connectivity loss or degraded throughput
  • Data loss: None directly, but dependent services may lose in-flight data
  • Customer impact: Timeouts, connection failures, or complete network unreachability
  • Engineering time to remediate: 8-16 engineer-hours including physical layer verification
  • Reputation cost: Network team credibility damaged; possible SLA credits to internal customers

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on static arp entry typo, they would have learned: Verify MAC addresses from arp -a or switch MAC table before adding static entries.
  • Footgun #2: If the engineer had read the primer, section on arp cache timeout too long, they would have learned: Keep ARP timeout at reasonable defaults (minutes, not hours).
  • Footgun #3: If the engineer had read the primer, section on gratuitous arp ignored, they would have learned: Enable gratuitous ARP on networks with failover or VRRP.
  • Footgun #4: If the engineer had read the primer, section on arp flood from misconfigured host, they would have learned: Enable Dynamic ARP Inspection (DAI) on the switch; use ARP filtering on hosts.

Cross-References