Postmortem: DNS CNAME Chain Breaks After Load Balancer Rename¶
| Field | Value |
|---|---|
| ID | PM-022 |
| Date | 2025-05-07 |
| Severity | SEV-3 |
| Duration | 0h 25m (detection to resolution) |
| Time to Detect | 8m |
| Time to Mitigate | 17m |
| Customer Impact | None — only internal services were affected; all external traffic used a separate load balancer path |
| Revenue Impact | None |
| Teams Involved | Networking, Backend Platform, DevOps |
| Postmortem Author | Fatima Al-Rashid |
| Postmortem Date | 2025-05-09 |
Executive Summary¶
On 2025-05-07 at 10:14 UTC, a DevOps engineer renamed an internal AWS Application Load Balancer from api-internal-legacy to api-internal as part of a naming standardization effort. The rename generated a new ELB DNS hostname; the old hostname (api-internal-legacy.us-east-1.elb.amazonaws.com) was immediately retired by AWS. Four internal services held Route53 CNAME records pointing to the old hostname. Within 5 minutes of cache expiry, all four services began receiving NXDOMAIN responses and returning HTTP 503 to internal callers. External customer traffic was unaffected because it used a separate load balancer referenced via Route53 alias records. The issue was resolved by updating the four CNAME records to point to the new ELB hostname. No data was lost and no customers were impacted.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 2025-05-07 09:58 | DevOps engineer Callum Hennessy applies Terraform change renaming ALB api-internal-legacy to api-internal; terraform apply completes successfully |
| 2025-05-07 09:59 | AWS provisions new ELB DNS name: api-internal.us-east-1.elb.amazonaws.com; old name api-internal-legacy.us-east-1.elb.amazonaws.com ceases to resolve |
| 2025-05-07 09:59 | Four internal CNAME records in Route53 continue to point to the old ELB hostname; DNS TTL on those records is 300 seconds |
| 2025-05-07 10:04 | DNS TTL for approximately half of internal resolvers expires; those resolvers begin returning NXDOMAIN for queries to api-internal-legacy.us-east-1.elb.amazonaws.com |
| 2025-05-07 10:06 | Internal service billing-aggregator begins logging getaddrinfo ENOTFOUND api-internal-legacy.us-east-1.elb.amazonaws.com; returns 503 to callers |
| 2025-05-07 10:07 | Datadog synthetic monitor fires: internal health check against http://api-gateway-internal/health returns 503 |
| 2025-05-07 10:08 | On-call engineer Yuki Tanaka acknowledges; checks API gateway logs; sees DNS resolution failures in upstream connection errors |
| 2025-05-07 10:11 | Yuki runs dig api-internal-legacy.us-east-1.elb.amazonaws.com from a jump host; receives NXDOMAIN; correlates with the ALB rename |
| 2025-05-07 10:13 | Yuki contacts Callum; Callum confirms the rename and identifies that CNAME records were not updated |
| 2025-05-07 10:14 | Yuki queries Route53 for all records referencing api-internal-legacy; finds 4 CNAME records across 2 hosted zones |
| 2025-05-07 10:17 | Callum begins updating CNAME records via Terraform; all 4 records updated to point to api-internal.us-east-1.elb.amazonaws.com |
| 2025-05-07 10:19 | terraform apply completes; new CNAME records propagate; DNS TTL of 60s on new records ensures fast propagation |
| 2025-05-07 10:22 | All four internal services resume successful DNS resolution; 503 errors cease |
| 2025-05-07 10:24 | Datadog synthetic monitor recovers; Yuki marks incident resolved |
| 2025-05-07 10:45 | Callum runs full Route53 audit to confirm no additional records reference the old hostname |
Impact¶
Customer Impact¶
None. External customer traffic was routed through a separate ALB (api-external) referenced by Route53 alias records. Alias records resolve differently from CNAMEs — they resolve directly to the ALB's current IP addresses via AWS's internal resolution, so they are not affected by ELB hostname changes. External APIs returned normal responses throughout the incident.
Internal Impact¶
- Four internal services (
billing-aggregator,reporting-service,audit-log-consumer,internal-dashboard) were unable to reach backend APIs for approximately 17 minutes. billing-aggregatorqueues hourly aggregation jobs; the 10:00 UTC job failed and required a manual rerun, costing approximately 25 minutes of engineering time.internal-dashboarddisplayed stale data to ~12 internal users (finance and ops teams) during the window; no decisions were made on stale data.- On-call engineer Yuki Tanaka and DevOps engineer Callum Hennessy each spent approximately 30 minutes on detection, communication, and remediation.
Data Impact¶
No data was lost. Queued events in billing-aggregator were retained in SQS and processed successfully on manual rerun. No writes were dropped.
Root Cause¶
What Happened (Technical)¶
AWS Application Load Balancers are identified externally by an auto-generated DNS hostname in the form <name>-<id>.<region>.elb.amazonaws.com. When an ALB is renamed, AWS generates a new hostname tied to the new name. The old hostname is immediately decommissioned — there is no grace period, no redirect, and no alias maintained for the old name.
Internal services at Meridian Systems had historically been connected using Route53 CNAME records that pointed directly to ELB DNS hostnames rather than using Route53 alias records. CNAME records are static: they store the literal target string and resolve it independently. When the target hostname ceases to resolve, all downstream services that depend on the CNAME receive NXDOMAIN and fail immediately (subject to TTL expiry for cached responses).
Callum's Terraform change modified only the name attribute of the aws_lb resource. No Terraform code existed to enumerate or validate CNAME records that referenced the old hostname. The dependency between the ALB name and the downstream CNAME records was entirely implicit and undocumented. The change was reviewed and approved as a cosmetic rename with no noted dependencies.
For approximately 5 minutes after the rename, the stale DNS TTL (300 seconds) masked the failure — resolvers with cached responses continued to reach the old (now nonexistent) ELB endpoint via their cache. When TTLs expired in waves, services began failing one by one as each resolver re-queried and received NXDOMAIN.
The correct long-term architecture uses Route53 alias records instead of CNAMEs for ELB targets. Alias records are AWS-native and resolve to the ALB's current IP addresses using Route53's internal knowledge of the ALB fleet. They do not depend on ELB hostname strings and are not broken by ALB renames.
Contributing Factors¶
- No DNS dependency map: There was no inventory or tooling to answer the question "which DNS records depend on this resource?" before making a change. The four CNAME records were discovered reactively (after the incident) using a Route53 audit script.
- LB rename classified as cosmetic, no impact assessment performed: The PR description read "rename ALB to match naming convention." No reviewer questioned whether downstream resources depended on the ELB hostname. The team's change review process had no checklist item for "does renaming this resource break any DNS references?"
- DNS TTL of 300s provided false safety: The 5-minute TTL delay between the rename and service failure meant the
terraform applyappeared to succeed without immediate consequence. Engineers often use "it didn't break immediately" as a proxy for "it didn't break," especially for changes perceived as low-risk. - CNAME records instead of alias records for ELB targets: Route53 alias records are the AWS-recommended pattern for referencing ELBs because they are name-agnostic (they track the ALB by resource ID, not DNS name). The four affected CNAME records were created before this best practice was adopted internally and had never been migrated.
What We Got Lucky About¶
- Only internal services used CNAME records pointing to the old ELB. External-facing traffic used Route53 alias records (implemented 18 months earlier as part of a reliability improvement) and was completely unaffected.
- The TTL on the affected CNAME records was 300 seconds, not a longer value. A 3600-second TTL would have masked the failure for up to an hour and then caused a much larger wave of simultaneous failures. The 5-minute TTL allowed detection within 8 minutes of the first resolver expiry.
Detection¶
How We Detected¶
A Datadog synthetic monitor running every 60 seconds against the internal API gateway health endpoint fired after two consecutive failures (503 responses). The synthetic was configured by the Backend Platform team following PM-019 and was the first signal.
Why We Didn't Detect Sooner¶
The 300-second DNS TTL introduced a lag between the root cause (ALB rename at 09:59) and observable impact (DNS failures beginning at 10:04). During that 5-minute window, all services continued to operate normally from cached DNS responses, providing no signal. There was no pre-deployment validation step that checked for DNS record dependencies on the resource being modified.
Response¶
What Went Well¶
- The Datadog synthetic monitor fired promptly at the first failure, giving the on-call engineer a clear alert within 1 minute of symptoms beginning.
- Yuki's immediate use of
digon the failing hostname identified the DNS failure class within 3 minutes of alert acknowledgement, avoiding time spent on application-layer debugging. - Communication between Yuki and Callum was fast — Callum was reachable in Slack within 2 minutes of Yuki identifying the rename as the likely cause.
- The Route53 CNAME records were managed in Terraform, so the fix was a targeted, reviewable change rather than a manual console edit.
What Went Poorly¶
- There was no pre-change dependency check. A simple Route53 audit script (which took Callum 15 minutes to write post-incident) could have been run before the rename and would have caught all 4 CNAME records.
- The change review process did not flag the rename as potentially impactful. The PR was approved in under 5 minutes with no comment on DNS implications.
- The 4 CNAME records had never been migrated to alias records despite the internal best-practice documentation recommending alias records for ELB targets since Q3 2023.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| PM-022-01 | Migrate all Route53 CNAME records targeting ELB hostnames to Route53 alias records; audit identifies 11 additional CNAMEs beyond the 4 in this incident | P1 | Callum Hennessy | In Progress | 2025-05-21 |
| PM-022-02 | Add pre-change script to ALB Terraform module that queries Route53 for CNAME records referencing the current ALB hostname and fails with a warning if any are found | P1 | Fatima Al-Rashid | Open | 2025-05-21 |
| PM-022-03 | Update infrastructure change review checklist to include: "Does renaming this resource invalidate any DNS records, service discovery entries, or TLS certificates?" | P2 | DevOps Team Lead | Open | 2025-05-14 |
| PM-022-04 | Add Route53 dependency map to internal service catalog (map each service to its DNS records and their target resources) | P2 | Networking | Open | 2025-06-04 |
| PM-022-05 | Reduce CNAME TTL to 60s for all records targeting mutable infrastructure resources (ELBs, NLBs) until migration to alias records is complete | P2 | Callum Hennessy | Open | 2025-05-14 |
Lessons Learned¶
- AWS ELB renames are not backward-compatible. Unlike many rename operations, changing an ALB name in AWS immediately retires the old DNS hostname with no grace period or alias. Any change to an ELB's name attribute must be preceded by a full dependency scan of DNS records, service discovery, and TLS certificates that reference the old name.
- DNS TTL delays are not safety margins. A successful
terraform applyfollowed by 5 minutes of silence is not evidence that a change was safe — it may simply be within the TTL window. Post-change validation must explicitly verify resolution of all hostnames associated with the modified resource, not rely on the absence of immediate errors. - CNAME records for mutable infrastructure are a latent liability. Route53 alias records decouple service discovery from the names of underlying infrastructure resources. Any CNAME record that points directly to a cloud-provider-assigned hostname inherits the full blast radius of any rename or replacement of that resource. Migrating to alias records is a reliability improvement, not just a best-practice preference.
Cross-References¶
- Failure Pattern: Hidden dependency — DNS CNAME chain implicitly coupled to infrastructure naming convention, not surfaced in dependency maps or change review
- Topic Packs: DNS fundamentals, AWS Route53 alias vs CNAME, AWS ELB lifecycle, infrastructure-as-code change safety
- Runbook:
runbooks/networking/route53-elb-dependency-audit.md - Decision Tree: Internal service 503 → check upstream DNS →
digtarget hostname → NXDOMAIN? → audit Route53 CNAMEs → identify changed ELB or target resource