Skip to content

Anti-Primer: DNS Ops

Everything that can go wrong, will — and in this story, it does.

The Setup

An on-call engineer is responding to a DNS outage affecting external customers. The primary DNS server is returning SERVFAIL for the company's main domain. Under pressure, the engineer starts editing zone files directly.

The Timeline

Hour 0: Forgetting to Increment Serial

Edits the zone file but does not increment the SOA serial number. The deadline was looming, and this seemed like the fastest path forward. But the result is secondary DNS servers do not pick up the change; inconsistent answers depending on which server responds.

Footgun #1: Forgetting to Increment Serial — edits the zone file but does not increment the SOA serial number, leading to secondary DNS servers do not pick up the change; inconsistent answers depending on which server responds.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Syntax Error in Zone File

Adds a missing dot at the end of an FQDN in the zone file. Under time pressure, the team chose speed over caution. But the result is BIND interprets host.example.com as host.example.com.example.com; records point to wrong hosts.

Footgun #2: Syntax Error in Zone File — adds a missing dot at the end of an FQDN in the zone file, leading to BIND interprets host.example.com as host.example.com.example.com; records point to wrong hosts.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Reloading Without Validation

Runs rndc reload without checking zone file syntax first. Nobody pushed back because the shortcut looked harmless in the moment. But the result is syntax error causes the entire zone to fail loading; all records for the domain return SERVFAIL.

Footgun #3: Reloading Without Validation — runs rndc reload without checking zone file syntax first, leading to syntax error causes the entire zone to fail loading; all records for the domain return SERVFAIL.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Deleting the Wrong Record

Deletes an A record thinking it is unused; it is the MX target. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is all incoming email bounces for 4 hours until someone notices the MX record points to a missing A record.

Footgun #4: Deleting the Wrong Record — deletes an A record thinking it is unused; it is the MX target, leading to all incoming email bounces for 4 hours until someone notices the MX record points to a missing A record.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Forgetting to Increment Serial Secondary DNS servers do not pick up the change; inconsistent answers depending on which server responds Primer: Always increment the serial number when editing zone files; use date-based serials
2 Syntax Error in Zone File BIND interprets host.example.com as host.example.com.example.com; records point to wrong hosts Primer: Always use trailing dots for FQDNs in zone files; run named-checkzone before reloading
3 Reloading Without Validation Syntax error causes the entire zone to fail loading; all records for the domain return SERVFAIL Primer: Run named-checkzone and named-checkconf before any reload
4 Deleting the Wrong Record All incoming email bounces for 4 hours until someone notices the MX record points to a missing A record Primer: Check reverse dependencies before deleting any DNS record

Damage Report

  • Downtime: 1-4 hours of connectivity loss or degraded throughput
  • Data loss: None directly, but dependent services may lose in-flight data
  • Customer impact: Timeouts, connection failures, or complete network unreachability
  • Engineering time to remediate: 8-16 engineer-hours including physical layer verification
  • Reputation cost: Network team credibility damaged; possible SLA credits to internal customers

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on forgetting to increment serial, they would have learned: Always increment the serial number when editing zone files; use date-based serials.
  • Footgun #2: If the engineer had read the primer, section on syntax error in zone file, they would have learned: Always use trailing dots for FQDNs in zone files; run named-checkzone before reloading.
  • Footgun #3: If the engineer had read the primer, section on reloading without validation, they would have learned: Run named-checkzone and named-checkconf before any reload.
  • Footgun #4: If the engineer had read the primer, section on deleting the wrong record, they would have learned: Check reverse dependencies before deleting any DNS record.

Cross-References