Postmortem: Kernel TCP Regression After Security Patch¶
| Field | Value |
|---|---|
| ID | PM-024 |
| Date | 2025-09-03 |
| Severity | SEV-3 |
| Duration | 1h 20m (detection to resolution) |
| Time to Detect | 18m |
| Time to Mitigate | 1h 2m |
| Customer Impact | None — affected services were internal data pipeline infrastructure; external APIs were unaffected |
| Revenue Impact | None |
| Teams Involved | Infrastructure Engineering, Data Platform, Security Operations |
| Postmortem Author | Nkechi Adeyemi |
| Postmortem Date | 2025-09-06 |
Executive Summary¶
On 2025-09-03 at 02:00 UTC, a scheduled monthly security kernel patching job rolled out Linux kernel 5.15.149 across the production fleet, replacing 5.15.147. The new kernel contained a regression in TCP window scaling that caused the receive window to be calculated incorrectly for connections with round-trip times greater than 10ms — a characteristic of all cross-AZ and cross-datacenter TCP connections. High-throughput, long-lived internal services (centralized log shipping via Fluent Bit and three Kafka-based data pipeline legs) experienced an immediate 60% drop in throughput. External-facing services were unaffected because their traffic pattern (short-lived HTTP request/response) did not exercise the regression code path. The incident was caught by a data pipeline SLA alert 18 minutes after the kernel rollout completed. Resolution required rolling back the kernel on affected hosts, which took 62 minutes due to fleet size and the need to coordinate with the Security Operations team before bypassing the security patch.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 2025-09-03 02:00 | Automated kernel patching job (Ansible) begins rolling out kernel 5.15.147 → 5.15.149 across 847 production hosts |
| 2025-09-03 02:00 | Patching applies to all hosts simultaneously (no canary configured for this patch due to critical security classification) |
| 2025-09-03 02:06 | Kernel rollout complete; all hosts running 5.15.149 and restarted |
| 2025-09-03 02:07 | Fluent Bit log shipping throughput begins dropping; Kafka broker replication lag begins rising on all 9 brokers |
| 2025-09-03 02:18 | Prometheus alert fires: kafka_consumer_lag_sum > 500000 sustained for 3 minutes on data pipeline consumer group |
| 2025-09-03 02:19 | On-call data platform engineer Marcus Johansson acknowledges; checks Kafka broker metrics |
| 2025-09-03 02:22 | Marcus observes all 9 Kafka brokers showing degraded throughput simultaneously; rules out single-broker failure |
| 2025-09-03 02:25 | Marcus checks Fluent Bit metrics; log ship rate has dropped from 450 MB/s to 180 MB/s fleet-wide |
| 2025-09-03 02:27 | Marcus escalates to infrastructure on-call (Nkechi Adeyemi); both begin correlating timing with recent changes |
| 2025-09-03 02:30 | Nkechi queries change log; identifies kernel patch rollout completed at 02:06 — 1 minute before degradation began |
| 2025-09-03 02:33 | Nkechi uses ss -ti on a Fluent Bit host; observes rcv_wnd values far below expected for AZ cross-traffic; suspects TCP window scaling regression |
| 2025-09-03 02:38 | Nkechi locates upstream kernel bug report: bugzilla.kernel.org #218374 — TCP window scaling regression in 5.15.149 for RTT > 10ms; confirmed in 5.15.147 is clean |
| 2025-09-03 02:41 | Nkechi contacts Security Operations lead (Prashant Gupta) to request authorization to roll back 5.15.149; security classification review required |
| 2025-09-03 02:53 | Prashant approves rollback after confirming the CVE addressed by 5.15.149 does not affect internal-only services in current threat model |
| 2025-09-03 02:55 | Ansible rollback job initiated; targets the 214 hosts running Kafka brokers and Fluent Bit aggregators first (highest impact) |
| 2025-09-03 03:08 | First rollback batch (214 hosts) complete; Kafka replication lag begins recovering; Fluent Bit throughput begins climbing |
| 2025-09-03 03:20 | Kafka consumer lag returns to baseline; Fluent Bit throughput at 440 MB/s; data pipeline SLA alert resolves |
| 2025-09-03 03:28 | Remainder of fleet rollback complete (633 remaining hosts); all hosts back on 5.15.147 |
| 2025-09-03 04:15 | Security Operations opens ticket to evaluate 5.15.150 (expected upstream fix release) and schedule re-patching |
Impact¶
Customer Impact¶
None. All customer-facing APIs run on a separate host pool with different traffic characteristics (short-lived HTTP connections, RTT <2ms within the same AZ). The TCP window scaling regression required RTT >10ms to manifest, which does not match external API traffic patterns. External API latency, throughput, and error rates were normal throughout.
Internal Impact¶
- Data pipeline lag: Three Kafka consumer groups processing internal telemetry, audit events, and analytics data fell behind by approximately 2.1 million events during the 1h 20m window. All were replayed and processed after recovery; no events were dropped (Kafka retention policy is 7 days).
- Log shipping gap: Centralized log ingestion dropped to 40% of normal rate for 73 minutes. Logs were buffered on-host by Fluent Bit's local buffer (configured for 4 hours of capacity) and flushed after recovery; no log data was lost.
- Engineering time: Nkechi Adeyemi and Marcus Johansson each spent approximately 2 hours on detection, diagnosis, coordination, and rollback. Prashant Gupta (Security Operations) spent 45 minutes on the security review and authorization. Total: approximately 5 person-hours.
- Data pipeline SLA: The data pipeline has an internal SLA of 15-minute end-to-end latency for analytics events. This SLA was breached for the full 80-minute incident window, affecting internal dashboards used by the Product Analytics team.
Data Impact¶
No data was lost. Kafka's retention and Fluent Bit's local disk buffer fully absorbed the throughput drop. All queued data was processed after kernel rollback. Event ordering was preserved within each partition.
Root Cause¶
What Happened (Technical)¶
Linux kernel 5.15.149 introduced a regression in the TCP receive window scaling implementation. The specific behavior: when TCP window scaling is negotiated (RFC 1323), the receive window advertised to the sender should scale proportionally with the amount of buffer space available and the connection's bandwidth-delay product. In 5.15.149, a change to the window scaling calculation (tcp_select_window()) introduced an off-by-one error in the shift factor applied when the connection RTT exceeds approximately 10ms and the socket receive buffer is larger than 87380 bytes (the default).
The practical effect was that the kernel advertised a receive window approximately 60% smaller than the available buffer space. The sending side, obeying the advertised window, throttled its send rate accordingly. For high-throughput senders (Kafka producers, Fluent Bit forwarders) this manifested as a throughput reduction of ~60% per connection — matching the observed degradation. Short-lived connections (HTTP request/response) were not affected because the window scaling issue only manifested after approximately 500ms of connection establishment, which is longer than most HTTP transactions.
All cross-AZ TCP connections at Meridian Systems have measured RTTs between 12ms and 35ms depending on AZ pair. This placed all data pipeline traffic squarely in the affected RTT range. Same-AZ connections (RTT <2ms) were unaffected, which explains why external API traffic (clustered within a single AZ) showed no degradation.
The upstream kernel community identified this regression in the 5.15.149 release shortly after GA (bug report filed 4 hours before the patch was released to distributions). A fix was included in 5.15.150 released 9 days later. The regression was present only in 5.15.149.
Contributing Factors¶
- No canary phase for security-critical kernel patches: The monthly patching process normally uses a 24-hour canary bake on 5% of hosts before fleet-wide rollout. This canary phase was skipped for 5.15.149 because the Security Operations team had classified the underlying CVE as critical severity and wanted the patch deployed within 12 hours of release. The canary skip was a known policy exception, but it removed the only mechanism that could have caught the regression before fleet-wide impact.
- No TCP-level regression tests in the patching pipeline: The patching pipeline ran basic health checks (HTTP endpoint reachability, systemd service status) but had no network-layer regression tests. A simple
iperf3throughput test between hosts in different AZs with RTT >10ms would have caught the window scaling regression during the canary phase — if a canary phase had been run. - Security urgency overrode operational risk assessment: The decision to skip the canary was made unilaterally by the Security Operations team based on CVE severity, without consulting the Data Platform team on the risk of fleet-wide simultaneous restarts and unvalidated kernel behavior. There was no cross-team sign-off process for canary exemptions.
- Lack of kernel-specific regression awareness: The team had no process for checking the upstream kernel bugzilla or CVE databases for regressions in candidate kernel versions before deploying. The bug report for the regression was publicly available before the kernel was pushed to the fleet.
What We Got Lucky About¶
- The regression only affected high-throughput, long-lived connections with RTT >10ms. Short-lived HTTP request/response transactions — which describe virtually all customer-facing traffic — were unaffected. If the regression had impacted all TCP connections regardless of RTT or connection lifetime, every service on the fleet would have degraded simultaneously, including customer-facing APIs.
- Fluent Bit's local buffer was configured for 4 hours of capacity, and Kafka's retention policy is 7 days. This meant that despite 73 minutes of reduced throughput, no data was lost. If the incident had lasted 4+ hours, Fluent Bit buffers would have overflowed and log data would have been dropped.
Detection¶
How We Detected¶
A Prometheus alert on Kafka consumer group lag fired after the data-pipeline-analytics consumer group exceeded 500,000 messages of lag for 3 consecutive minutes. The alert was the primary detection signal. Secondary confirmation came from the Fluent Bit throughput dashboard, which showed a fleet-wide simultaneous drop beginning at 02:07.
Why We Didn't Detect Sooner¶
The 18-minute detection time was a combination of alert threshold (the consumer lag alert required 3 minutes of sustained threshold breach before firing, adding delay) and the gap between the kernel rollout completing (02:06) and the first alert fire (02:18). There were no kernel-level or TCP-level metrics in the monitoring stack — no alerts on rcv_wnd values, TCP retransmit rates, or socket buffer utilization that would have fired closer to the root cause.
Response¶
What Went Well¶
- Marcus correctly identified that the simultaneous degradation across all 9 Kafka brokers ruled out a single-node failure within the first 3 minutes of investigation, immediately raising the scope of the inquiry to fleet-level changes.
- Nkechi used
ss -ti(TCP socket info) to inspect TCP window state at the kernel level — a diagnostic step that directly revealed the anomalousrcv_wndvalues and pointed to a kernel-level issue. This level of diagnostic depth avoided spending time on application-layer debugging. - The upstream kernel bug report was locatable within 5 minutes of suspecting the kernel, shortening root cause confirmation significantly.
- Communication with Security Operations for rollback authorization was smooth despite being an atypical request; Prashant had the authority and information to approve within 12 minutes.
What Went Poorly¶
- The canary phase was waived without a cross-team risk assessment. A 24-hour canary on 5% of hosts is a low-cost safety net; waiving it should require documented acknowledgement of the increased risk from all affected teams.
- The team had no pre-deployment process for reviewing kernel changelogs or upstream bug reports before fleet-wide deployment. This information is publicly available and the regression bug was filed before the kernel was shipped to the fleet.
- The patching pipeline's health checks were insufficient — HTTP endpoint checks do not validate network-layer behavior. Adding iperf3 or similar throughput regression tests during canary phases would catch this class of issue.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| PM-024-01 | Add mandatory cross-team sign-off (Data Platform + Infrastructure) to canary-skip requests for kernel patches; document the operational risk explicitly before approval | P1 | Nkechi Adeyemi | In Progress | 2025-09-10 |
| PM-024-02 | Add iperf3 cross-AZ throughput regression test to canary validation suite; must pass before fleet rollout proceeds (threshold: >85% of baseline throughput) |
P1 | Infrastructure Engineering | Open | 2025-09-17 |
| PM-024-03 | Add pre-deployment step to patching runbook: query bugzilla.kernel.org and cve.mitre.org for the target kernel version; summarize open regressions in the patch ticket |
P2 | Security Operations | Open | 2025-09-17 |
| PM-024-04 | Add TCP-level metrics to Prometheus: node_netstat_TcpExt_TCPWantZeroWindowAdv, node_netstat_TcpExt_TCPToZeroWindowAdv, and socket rcv_wnd histograms via node_exporter custom collectors |
P2 | Nkechi Adeyemi | Open | 2025-09-24 |
| PM-024-05 | Schedule re-patching to kernel 5.15.150 (which contains the regression fix) with full canary phase; coordinate with Security Operations on timeline | P1 | Nkechi Adeyemi | Open | 2025-09-12 |
| PM-024-06 | Review Fluent Bit buffer capacity policy: confirm all fleet hosts have minimum 4-hour local buffer; add alert if buffer utilization exceeds 50% during normal operation | P3 | Data Platform | Open | 2025-10-01 |
Lessons Learned¶
- Security urgency and operational risk assessment are not mutually exclusive. A critical CVE requires fast patching, but "fast" should mean canary-first within hours, not skip-canary and push fleet-wide. A canary phase of even 2-4 hours catches most regressions without materially increasing the CVE exposure window for internal infrastructure.
- Kernel regressions are publicly documented before they are widely deployed. Upstream kernel bugzillas and mailing lists often contain regression reports within hours of a release. Adding a lightweight pre-deployment check against public kernel bug databases is low-cost and can prevent fleet-wide rollouts of known-bad kernels.
- TCP-layer regressions are invisible to application-layer health checks. An HTTP 200 response tells you the application is alive; it does not tell you anything about socket buffer state, window scaling, or retransmit rates. For infrastructure that depends on high-throughput TCP (Kafka, data pipelines, log shippers), network-layer regression tests are a necessary part of the change validation suite.
Cross-References¶
- Failure Pattern: Software regression — kernel update introduces latent behavior change that degrades a specific traffic pattern without affecting general connectivity or application-layer health checks
- Topic Packs: Linux kernel TCP stack, TCP window scaling (RFC 1323), Kafka replication internals, kernel patching operations, canary deployment strategy
- Runbook:
runbooks/infrastructure/kernel-rollback-procedure.md - Decision Tree: Fleet-wide simultaneous throughput drop → correlate with recent fleet-level changes (kernel, config, package) → isolate to specific hosts or traffic patterns → kernel version diff → upstream bug search → rollback vs. workaround decision