ARP - Street-Level Ops¶

Real-world ARP diagnosis and resolution workflows for production environments.

Under the hood: ARP maps IP addresses (L3) to MAC addresses (L2). Every packet on a local subnet requires an ARP lookup first. The kernel caches results in the neighbor table (ip neigh show), with entries cycling through REACHABLE -> STALE -> FAILED states.

Task: Host Cannot Reach Gateway on Same Subnet¶

# Check if gateway MAC is resolved
$ ip neigh show 10.0.0.1
10.0.0.1 dev eth0  FAILED

# ARP resolution is failing — send a manual request
$ arping -c 3 -I eth0 10.0.0.1
ARPING 10.0.0.1 from 10.0.0.50 eth0
Unicast reply from 10.0.0.1 [aa:bb:cc:dd:ee:ff]  1.023ms
Unicast reply from 10.0.0.1 [aa:bb:cc:dd:ee:ff]  0.987ms

# Gateway is responding — flush stale entry and retry
$ ip neigh flush dev eth0
$ ping -c 1 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.5 ms

Remember: ARP neighbor states mnemonic: R-S-D-F-P — Reachable, Stale, Delay, Failed, Permanent. A healthy entry cycles R -> S -> (re-confirmed back to R or expires to F). If you see lots of FAILED entries, ARP requests are going unanswered — check cabling, VLAN assignment, or whether the target host is actually up.

Task: Detect Duplicate IP Addresses¶

# Check if an IP is already in use before assigning it
$ arping -D -c 3 -I eth0 10.0.0.25
ARPING 10.0.0.25 from 0.0.0.0 eth0
Unicast reply from 10.0.0.25 [de:ad:be:ef:00:01]  0.842ms
Received 1 response(s)

# Exit code 1 = duplicate detected. Do not assign this IP.
$ echo $?
1

Gotcha: DHCP servers can hand out an IP that is already statically assigned. Always run arping -D before assigning static IPs, and reserve static ranges in your DHCP scope to prevent overlap.

Task: Watch ARP Traffic During Failover¶

# Monitor ARP in real-time during a keepalived failover test
$ tcpdump -i eth0 -nn arp
15:04:22.001 ARP, Reply 10.0.0.100 is-at aa:bb:cc:00:11:22, length 28
15:04:22.002 ARP, Reply 10.0.0.100 is-at aa:bb:cc:00:11:22, length 28

# Gratuitous ARP from new VRRP master — VIP 10.0.0.100 now on new MAC

Debug clue: If you see two different MACs replying for the same IP in tcpdump -nn arp, you have either a duplicate IP or a split-brain failover. Check both hosts immediately — one of them should not own that IP.

Task: Force ARP Update After NIC Replacement¶

# Server got a new NIC — neighbors still have old MAC cached
# Send gratuitous ARP to update all neighbors
$ arping -U -c 5 -I eth0 10.0.0.50
ARPING 10.0.0.50 from 10.0.0.50 eth0
Sent 5 probes

# Verify from another host
$ ip neigh show 10.0.0.50
10.0.0.50 dev eth0 lladdr aa:bb:cc:dd:ee:ff REACHABLE

Task: Diagnose ARP Table Overflow¶

# Symptoms: random connectivity drops, dmesg shows errors
$ dmesg | grep -i neighbour
[42098.123] neighbour table overflow

# Check current limits
$ sysctl net.ipv4.neigh.default.gc_thresh3
net.ipv4.neigh.default.gc_thresh3 = 1024

# Check how many entries exist
$ ip neigh show | wc -l
1019

# Increase limits for large flat network
$ sysctl -w net.ipv4.neigh.default.gc_thresh1=4096
$ sysctl -w net.ipv4.neigh.default.gc_thresh2=8192
$ sysctl -w net.ipv4.neigh.default.gc_thresh3=16384

# Make persistent
$ cat >> /etc/sysctl.d/99-arp.conf <<'EOF'
net.ipv4.neigh.default.gc_thresh1 = 4096
net.ipv4.neigh.default.gc_thresh2 = 8192
net.ipv4.neigh.default.gc_thresh3 = 16384
EOF

Scale note: Flat L2 networks with thousands of hosts (common in legacy data centers) hit ARP table limits regularly. The default gc_thresh3=1024 is fine for small networks but too low for /16 subnets. Kubernetes nodes in large clusters also hit this — each pod IP adds a neighbor entry.

Under the hood: The kernel ARP garbage collector has three thresholds (from man 7 arp and kernel docs): gc_thresh1=128 (minimum entries to keep), gc_thresh2=512 (soft max — GC allows exceeding this for 5 seconds), gc_thresh3=1024 (hard max — GC always runs above this). On Kubernetes nodes with many pods, set all three higher: 4096/8192/16384 is common for clusters with 100+ pods per node.

Task: Add Static ARP Entry for Critical Gateway¶

# Prevent ARP spoofing on a critical gateway
$ ip neigh add 10.0.0.1 lladdr aa:bb:cc:dd:ee:ff dev eth0 nud permanent
$ ip neigh show 10.0.0.1
10.0.0.1 dev eth0 lladdr aa:bb:cc:dd:ee:ff PERMANENT

# Verify it survives ARP flush
$ ip neigh flush dev eth0
$ ip neigh show 10.0.0.1
10.0.0.1 dev eth0 lladdr aa:bb:cc:dd:ee:ff PERMANENT

Interview tip: "How does a host communicate on a local subnet?" Strong answer: the sender checks if the destination IP is in the same subnet, then broadcasts an ARP request ("who has 10.0.0.5?"), the target responds with its MAC, the sender caches the MAC in its neighbor table, and frames are sent directly at L2. Mentioning the ARP cache TTL and gratuitous ARP for failover shows depth.

Task: Debug ARP on Multi-Homed Host¶

# Host has eth0 (10.0.0.5) and eth1 (10.0.1.5)
# ARP responses are coming from the wrong interface

# Check current arp_ignore setting
$ sysctl net.ipv4.conf.all.arp_ignore
net.ipv4.conf.all.arp_ignore = 0

# Fix: only respond on the interface that owns the IP
$ sysctl -w net.ipv4.conf.all.arp_ignore=1
$ sysctl -w net.ipv4.conf.all.arp_announce=2

# Verify from remote host — ARP reply now comes from correct interface
$ arping -c 1 -I eth0 10.0.0.5
Unicast reply from 10.0.0.5 [correct:mac:here:00:00:01]  0.5ms

Task: Investigate Intermittent Connectivity (Stale ARP)¶

# Application reports intermittent timeouts to 10.0.0.20
$ ip neigh show 10.0.0.20
10.0.0.20 dev eth0 lladdr ff:ee:dd:cc:bb:aa STALE

# MAC might be stale after a VM migration — force re-resolve
$ ip neigh del 10.0.0.20 dev eth0
$ ping -c 1 10.0.0.20
64 bytes from 10.0.0.20: icmp_seq=1 ttl=64 time=0.6 ms

$ ip neigh show 10.0.0.20
10.0.0.20 dev eth0 lladdr 11:22:33:44:55:66 REACHABLE
# New MAC — the VM migrated to a different hypervisor

Emergency: ARP Storm Flooding Network¶

# High broadcast traffic — suspect ARP storm
$ tcpdump -i eth0 -nn arp -c 100 | head -20
# Hundreds of ARP requests per second for same IP

# Check if proxy ARP is enabled (common cause)
$ sysctl net.ipv4.conf.eth0.proxy_arp
net.ipv4.conf.eth0.proxy_arp = 1

# Disable proxy ARP if not intentional
$ sysctl -w net.ipv4.conf.eth0.proxy_arp=0

Default trap: proxy_arp defaults to 0, but some VPN and container networking setups enable it. If someone turned it on to "fix" a routing issue and forgot, it will answer ARP requests for IPs it does not own, causing misdirected traffic.

ARP - Street-Level Ops¶

Task: Host Cannot Reach Gateway on Same Subnet¶

Task: Detect Duplicate IP Addresses¶

Task: Watch ARP Traffic During Failover¶

Task: Force ARP Update After NIC Replacement¶

Task: Diagnose ARP Table Overflow¶

Task: Add Static ARP Entry for Critical Gateway¶

Task: Debug ARP on Multi-Homed Host¶

Task: Investigate Intermittent Connectivity (Stale ARP)¶

Emergency: ARP Storm Flooding Network¶

Pages that link here¶