Skip to content

Incident Replay: DHCP Relay Broken

Setup

  • System context: Corporate network with centralized DHCP server. Remote office VLAN clients cannot obtain IP addresses after a router upgrade. DHCP relay agent on the router forwards requests to the central DHCP server.
  • Time: Monday 08:00 UTC
  • Your role: Network engineer

Round 1: Alert Fires

[Pressure cue: "Remote office users reporting 'No network connection.' All 50 workstations in VLAN 200 cannot get DHCP leases. Office is dead. Helpdesk overwhelmed."]

What you see: Workstations show "APIPA address" (169.254.x.x). The DHCP server log shows no DISCOVER packets from VLAN 200. Other VLANs are working fine.

Choose your action: - A) Restart the DHCP server - B) Check the DHCP relay configuration on the router for VLAN 200 - C) Assign static IPs to workstations as a workaround - D) Check if VLAN 200 is configured correctly on the switch

[Result: show running-config interface vlan200 reveals the ip helper-address command is missing. The router upgrade replaced the config and the DHCP relay for VLAN 200 was not included in the new config. Other VLANs still have their helper-address because they were in a different config section that was preserved. Proceed to Round 2.]

If you chose A:

[Result: DHCP server is healthy — it serves other VLANs fine. Not a server issue.]

If you chose C:

[Result: Static IPs for 50 workstations would take hours and creates management overhead.]

If you chose D:

[Result: VLAN 200 is correctly configured on the switches. The L2 path is fine — the issue is L3 relay.]

Round 2: First Triage Data

[Pressure cue: "DHCP relay missing on VLAN 200. Fix it."]

What you see: The router upgrade 3 days ago merged the new config with the old one. The ip helper-address 10.0.0.10 line was dropped from the VLAN 200 SVI during the merge. Existing leases expired over the weekend, and users noticed Monday morning.

Choose your action: - A) Add the ip helper-address back to the VLAN 200 SVI - B) Add the helper-address and also verify all other VLANs still have theirs - C) Roll back the entire router config to the pre-upgrade version - D) Set up a local DHCP server at the remote office instead

[Result: interface vlan200 -> ip helper-address 10.0.0.10. Applied. Also audit all VLANs — VLAN 200 was the only one affected. Workstations begin obtaining leases within 30 seconds. Proceed to Round 3.]

If you chose A:

[Result: Fixes VLAN 200 but you should verify other VLANs too.]

If you chose C:

[Result: Rolling back loses the upgrade benefits (security patches, bug fixes). Overkill for a one-line config fix.]

If you chose D:

[Result: Local DHCP is complex — split-scope, conflict detection, management overhead. Not justified.]

Round 3: Root Cause Identification

[Pressure cue: "Office is back online. Document the failure."]

What you see: Root cause: Router config merge during the upgrade dropped the ip helper-address from VLAN 200's SVI. The pre-upgrade config backup was taken but the post-upgrade config was not diff'd against the backup to verify completeness.

Choose your action: - A) Add a post-upgrade config diff check to the upgrade procedure - B) Use a config management tool (RANCID/Oxidized) to track and alert on config changes - C) Add DHCP lease monitoring to detect relay failures faster - D) All of the above

[Result: Config diff catches missing lines. Config management tracks all changes. DHCP monitoring detects lease failures within minutes. Proceed to Round 4.]

If you chose A:

[Result: Manual diff is good but automated comparison is better.]

If you chose B:

[Result: Config management catches changes but might not flag a missing line as anomalous.]

If you chose C:

[Result: Faster detection but does not prevent the configuration error.]

Round 4: Remediation

[Pressure cue: "Office working. Close the incident."]

Actions: 1. Verify all 50 workstations have DHCP leases: check DHCP server lease table 2. Verify DHCP relay is configured on all required VLANs 3. Take a post-fix config backup 4. Add config diff check to the router upgrade runbook 5. Deploy DHCP lease monitoring for all remote offices

Damage Report

  • Total downtime: ~3 hours Monday morning (users arrived to broken networking)
  • Blast radius: 50 workstations at one remote office; entire office unproductive
  • Optimal resolution time: 5 minutes (check relay config -> add helper-address -> verify)
  • If every wrong choice was made: 4+ hours with server restarts and static IP assignments

Cross-References