Postmortem: AWS AZ Network Degradation Triggers Cascading Health Check Failures¶

Field	Value
ID	PM-009
Date	2025-06-03
Severity	SEV-2
Duration	1h 15m (detection to resolution)
Time to Detect	8m
Time to Mitigate	1h 15m
Customer Impact	30% of requests experienced >2s latency; 5% of requests returned 503; approximately 41,000 affected requests over 67 minutes
Revenue Impact	~$22,500 estimated (503 errors on checkout and API endpoints; SLA credit obligations to 2 enterprise customers)
Teams Involved	Site Reliability Engineering, Cloud Infrastructure, Incident Command, Networking
Postmortem Author	Svetlana Marchetti
Postmortem Date	2025-06-06

Executive Summary¶

On 2025-06-03, AWS us-east-1a experienced elevated packet loss (measured at 1.8–2.3%) due to an underlying network issue AWS later attributed to a misconfigured internal router. The company's Application Load Balancer (ALB) health checks, configured with a 5-second timeout and an unhealthy threshold of 2, began marking healthy EC2 instances in us-east-1a as unhealthy and deregistering them. Traffic concentrated on us-east-1b and us-east-1c, which became CPU- and connection-saturated. The autoscaling group was at its maximum configured capacity and could not scale further in the surviving AZs. The result was 30% elevated latency and 5% 503 error rate for 67 minutes. Recovery required manual ALB health check threshold adjustment, coordinated with AWS confirming the us-east-1a issue was resolved.

Timeline (All times UTC)¶

Time	Event
11:02	AWS us-east-1a network degradation begins; packet loss of ~2% on intra-AZ and cross-AZ traffic
11:05	ALB health checks to us-east-1a instances begin failing intermittently; 5s timeout, 2 unhealthy threshold means 10 seconds of failed checks = deregistration
11:07	First us-east-1a instance deregistered from ALB target group (14 instances in 1a; first 2 deregistered)
11:10	PagerDuty alert: ALB 5xx error rate exceeds 1% threshold; on-call SRE Bogdan Iliescu acknowledges
11:11	Bogdan opens CloudWatch; sees 503 rate at 2.4%, requests shifting to 1b and 1c
11:13	Bogdan checks EC2 instance health; sees 9 of 14 us-east-1a instances in `OutOfService` state in ALB console
11:14	Bogdan SSHs to a 1a instance marked unhealthy; instance responds normally, application is healthy
11:15	Bogdan pages Incident Commander Svetlana Marchetti and Cloud Infrastructure lead Yusuf Karahan
11:16	Svetlana opens incident bridge; initial hypothesis is application bug on 1a instances
11:18	Yusuf checks AWS Health Dashboard; no us-east-1a event posted yet
11:19	All 14 us-east-1a instances deregistered from ALB; all traffic on 1b (18 instances) and 1c (16 instances)
11:20	us-east-1b and 1c CPU utilization reaches 87–92%; connection queue depths rising
11:21	Autoscaling group at maximum capacity (48 instances total); scale-out not possible
11:22	503 rate peaks at 6.8%; P99 latency at 4.1s (baseline: 320ms)
11:24	Yusuf checks `traceroute` and `mtr` from a 1a instance to a 1b instance; sees ~2% packet loss
11:25	Hypothesis shifts from application bug to AZ network degradation
11:26	Svetlana contacts AWS Enterprise Support via phone (severity: production down)
11:28	AWS status page posts: "We are investigating elevated network latency in US-East-1a"
11:29	Team confirms: this is an AWS-side issue; pivot to mitigation rather than root cause investigation
11:31	Decision: loosen ALB health check thresholds temporarily to stop further deregistrations; set unhealthy threshold to 5 (was 2), timeout to 10s (was 5s)
11:33	ALB health check config updated; no immediate re-registration of 1a instances (already deregistered)
11:35	Team debates re-registering 1a instances manually; risk: if network still degraded, re-adding them could route real traffic through lossy path
11:38	AWS Support confirms packet loss is subsiding; team agrees to re-register 1a instances in batches
11:40	First batch of 5 us-east-1a instances re-registered; monitoring for 503 rate increase
11:43	No 503 spike; remaining 9 us-east-1a instances re-registered
11:47	Load redistributes across all 3 AZs; CPU on 1b and 1c drops to 55–60%
12:17	Sustained baseline metrics across all AZs; 503 rate < 0.1%; incident resolved
12:20	All-clear declared; AWS posts resolution on status page
12:35	Post-incident: team begins analysis of health check configuration and autoscaling ceiling

Impact¶

Customer Impact¶

Approximately 41,000 requests were affected during the 67-minute impact window (11:10–12:17 UTC): roughly 34,000 experienced P99 latency > 2s and approximately 7,000 received 503 responses. The checkout flow and REST API endpoints were most affected. Two enterprise customers with contractual 99.9% monthly availability SLAs received SLA breach notifications (the incident contributed 0.09% downtime to their monthly calculations). Consumer-facing web traffic saw a 12% increase in bounce rate during the peak degradation window (11:19–11:47 UTC).

Internal Impact¶

SRE and Cloud Infrastructure teams spent approximately 8 hours total during and after the incident: 2 hours active response, 3 hours post-incident architecture review, 3 hours drafting AWS Support follow-up and SLA credit requests. Networking team spent 2 hours reviewing cross-AZ health check architecture. A scheduled capacity planning review was deferred by one week.

Data Impact¶

No data loss. All 503 errors were returned at the load balancer tier before reaching application servers; no in-flight transactions were corrupted. CloudWatch logs confirm no write operations were partially committed during the degradation window.

Root Cause¶

What Happened (Technical)¶

AWS us-east-1a experienced network-layer packet loss of approximately 1.8–2.3%, later attributed by AWS to a misconfigured internal router affecting a subset of instances in the AZ. The packet loss was not severe enough to prevent TCP connections from establishing but was sufficient to cause intermittent TCP segment retransmissions and timeout events.

The company's ALB target group for the production fleet was configured with health check parameters optimized for detecting crashed application processes rather than transient network degradation: HealthCheckIntervalSeconds: 10, HealthCheckTimeoutSeconds: 5, UnhealthyThresholdCount: 2. Under these settings, two consecutive health check failures within 20 seconds are sufficient to deregister an instance. At 2% packet loss, a 5-second TCP health check has an approximately 4% probability of timing out per attempt. Two consecutive failures — triggering deregistration — have approximately 0.16% probability per interval. With 14 instances in us-east-1a being checked every 10 seconds, the expected time to deregister all instances at this packet loss rate is approximately 10–15 minutes. This matched the observed behavior precisely.

Once all 14 us-east-1a instances were deregistered, the remaining 34 instances in 1b and 1c absorbed the full production traffic load. Pre-incident, the fleet was operating at 61% average CPU utilization across all 48 instances. With only 34 instances, the per-instance load increased to approximately 86% — near the autoscaling scale-out threshold of 80% CPU. The autoscaling group's MaxSize was set to 48, which the fleet had already reached. Launching additional capacity in 1b or 1c was blocked by the ASG ceiling. The combination of no spare capacity headroom, tight health check thresholds, and no AZ-aware traffic shaping produced the failure cascade.

The ALB does not distinguish between "instance is down" and "instance is unreachable due to AZ network issue." Both conditions result in identical health check failures. The company had no mechanism to detect "AZ-wide health check failure pattern" as distinct from "individual instance failures" and respond differently (e.g., by relaxing thresholds or suspending deregistration when >50% of an AZ fails simultaneously).

Contributing Factors¶

Health check thresholds too aggressive for cross-AZ network variance: The UnhealthyThresholdCount: 2 with HealthCheckTimeoutSeconds: 5 configuration was designed for a single-AZ environment and never revisited when the architecture went multi-AZ. Cross-AZ health checks traverse more network hops and are more sensitive to transient packet loss. AWS recommends an unhealthy threshold of 3–5 for multi-AZ deployments with health checks crossing AZ boundaries.
No AZ-aware traffic shaping or circuit breaker: When an AZ becomes degraded, the correct response is to gradually reduce traffic to it rather than binary deregistration. The ALB deregisters instances entirely on threshold breach, with no partial-weight reduction. AWS Application Load Balancers do not natively support weighted AZ routing; this would require a separate layer (e.g., Route 53 weighted records or a service mesh). No such layer existed.
Autoscaling group at maximum capacity with no room to absorb AZ loss: The ASG MaxSize: 48 reflected the maximum the account's EC2 service quota allowed at the time (a quota increase request was pending). Operating at maximum capacity with no headroom means any significant reduction in available instances — whether from AZ loss, instance failures, or deployments — cannot be compensated by scaling. The fleet should be sized so that losing an entire AZ leaves remaining capacity at < 70% utilization.

What We Got Lucky About¶

AWS posted the us-east-1a issue on their status page within 20 minutes of the network degradation beginning — unusually fast relative to their historical status page update latency (often 30–60 minutes). This confirmed the team's hypothesis, stopped them from pursuing phantom application bugs, and enabled a confident pivot to mitigation. A longer diagnosis delay would have extended the incident significantly.
The two enterprise customers whose SLA thresholds were breached had contractual grace periods for AWS-attributed outages. Legal confirmed that the AWS Health Dashboard event documentation was sufficient to invoke the force majeure clause in both contracts, reducing the credit liability from $18,000 to $4,500 in rebates.

Detection¶

How We Detected¶

CloudWatch metric alarm on ALB HTTPCode_ELB_5XX_Count exceeding 1% over a 2-minute period (PagerDuty, 11:10 UTC). This was the primary alerting mechanism and fired 8 minutes after the network degradation began. The 8-minute gap reflects the time for packet loss to cause two consecutive health check failures, deregister enough instances to overload 1b/1c, and for the resulting 503s to accumulate past the alert threshold.

Why We Didn't Detect Sooner¶

There is no alert on "instances entering OutOfService state in an ALB target group." An alert watching for more than 20% of a target group's instances becoming unhealthy within 5 minutes would have fired at approximately 11:09 UTC, 1–2 minutes earlier. More importantly, it would have provided clearer signal (ALB health check failures) rather than requiring the SRE to infer this from the 503 rate alert. The ALB health check configuration was not treated as a monitoring surface in its own right.

Response¶

What Went Well¶

The mtr test from a 1a instance to 1b at 11:24 UTC was the pivotal diagnostic step that shifted the hypothesis from application failure to network degradation. The Networking team's runbook for AZ degradation scenarios includes this test; the SRE knew to run it.
Calling AWS Enterprise Support by phone at 11:26 UTC rather than opening a ticket was the right call for a production-down scenario. The phone escalation path got a response within 4 minutes; a ticket would have taken 20–30 minutes.
The batch re-registration approach (5 instances first, monitor, then rest) was cautious and correct. Re-adding all 14 instances simultaneously to a still-possibly-degraded AZ could have caused a second wave of 503s if the network had not fully recovered.

What Went Poorly¶

The team spent 11 minutes (11:14–11:25) investigating an application-level cause after confirming that individual 1a instances responded normally. A documented decision tree for "ALB deregistered multiple instances simultaneously" should lead directly to AZ-level network checks rather than application investigation.
The autoscaling maximum capacity limit was a known constraint with a pending quota increase request, but no runbook addressed "ASG at MaxSize during AZ loss." The team improvised the re-registration approach during the incident rather than following a pre-planned procedure.
The AWS Health Dashboard check at 11:18 (before the event was posted) yielded a false negative and briefly misdirected the team. The runbook should note that AWS status page latency means an absence of an event does not rule out an AWS issue.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
AI-009-01	Update ALB health check config: `UnhealthyThresholdCount: 5`, `HealthCheckTimeoutSeconds: 10`, `HealthCheckIntervalSeconds: 15`; document rationale in Terraform comments	P0	Cloud Infrastructure	Open	2025-06-10
AI-009-02	Raise ASG `MaxSize` from 48 to 64 (pending EC2 quota increase already submitted; follow up with AWS); update capacity planning to ensure AZ loss leaves remaining AZs at < 70% CPU	P0	Cloud Infrastructure	Open	2025-06-17
AI-009-03	Add CloudWatch alarm: > 25% of ALB target group instances entering OutOfService within 5 minutes; page SRE on-call	P1	Site Reliability Engineering	Open	2025-06-13
AI-009-04	Add runbook section: "AZ-Wide Instance Deregistration Pattern" with decision tree and re-registration procedure; link from ALB health check alarm	P1	Site Reliability Engineering	Open	2025-06-20
AI-009-05	Evaluate Route 53 weighted AZ routing or AWS Global Accelerator for AZ-aware traffic shaping; produce architecture recommendation	P2	Cloud Infrastructure / Networking	Open	2025-07-11
AI-009-06	Add note to AWS issue runbook: status page lag is typically 20–60 minutes; absence of event does not rule out AWS infrastructure issue; use `mtr` and AWS Support phone as parallel tracks	P2	Site Reliability Engineering	Open	2025-06-20

Lessons Learned¶

Health check parameters encode assumptions about your network topology: A 2-failure unhealthy threshold made sense in a single-AZ setup but is too aggressive in a multi-AZ deployment where cross-AZ health checks traverse more network hops. Health check parameters should be revisited every time the network architecture changes.
"ASG at MaxSize" is a latent single-AZ-loss vulnerability: A fleet at maximum capacity has no elasticity to absorb the loss of an availability zone. Capacity planning must treat "lose an AZ" as a design constraint, not just a monitoring scenario, and ensure the remaining fleet has headroom to handle full load redistribution.
AZ-correlated failures are distinct from instance failures and require different detection: Monitoring that watches individual instance health will always lag behind an AZ-wide event. A fleet-level view — "what fraction of an AZ's instances became unhealthy simultaneously?" — is a faster and cleaner signal for network-layer AZ degradation.

Cross-References¶

Failure Pattern: Cascading Health Check Failure — AZ Network Degradation; Capacity Ceiling During AZ Loss
Topic Packs: AWS ALB and Target Groups; EC2 Auto Scaling; Multi-AZ Architecture; Cloud Incident Response
Runbook: runbooks/cloud/alb-target-group-triage.md; runbooks/cloud/aws-az-degradation.md; runbooks/cloud/asg-capacity-runbook.md
Decision Tree: ALB 5xx Spike → Check Target Group Health → If >20% instances OutOfService in same AZ → Suspect AZ network → Run mtr from AZ instance → Call AWS Support + Check status page in parallel → Loosen health check thresholds → Re-register in batches after confirmation