Postmortem: Memory Leak in Log Shipping Agent Causes Fleet-Wide OOM Kills¶
| Field | Value |
|---|---|
| ID | PM-008 |
| Date | 2025-05-14 |
| Severity | SEV-2 |
| Duration | 52m (detection to resolution) |
| Time to Detect | 14m |
| Time to Mitigate | 52m |
| Customer Impact | API error rate reached 4.2% for 18 minutes; approximately 6,300 requests returned 502 or 504 |
| Revenue Impact | ~$5,800 estimated (failed API transactions, partner SLA credits) |
| Teams Involved | Observability Platform, Core Infrastructure, Python Services, Incident Command |
| Postmortem Author | Amara Osei-Bonsu |
| Postmortem Date | 2025-05-17 |
Executive Summary¶
On 2025-05-14, a newly deployed Python microservice (catalog-indexer) began emitting approximately 5% malformed JSON log lines due to a bug in its custom logger. These malformed lines triggered a memory leak in the fleet-wide Fluentd JSON parsing plugin, causing all 22 Fluentd DaemonSet pods to gradually exhaust their memory and be OOM-killed by the Linux kernel. As Fluentd pods entered crash loops, the kubelet on affected nodes began experiencing memory pressure, leading to cascading pod evictions. The incident lasted 52 minutes from first alert to full fleet recovery. Because Fluentd pods have Burstable QoS (no memory limit set), they absorbed the OOM kills before Guaranteed-class customer-facing workloads were affected on most nodes.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 08:15 | catalog-indexer v1.4.0 deployed to production via ArgoCD; service uses a custom Python logger class that formats JSON manually |
| 08:17 | catalog-indexer begins emitting log lines; ~5% of lines are malformed (missing closing }) due to exception path in the custom logger not properly closing the JSON object |
| 08:17 | Fluentd pods on all 22 nodes begin processing catalog-indexer logs via the fluent-plugin-json-parse plugin (version 0.1.2, last updated 8 months prior) |
| 08:19 | Memory growth in Fluentd pods begins; plugin holds references to partially parsed JSON buffers that are never freed on parse failure |
| 08:31 | Fluentd pod on node-09 reaches 512 MiB RSS (no memory limit configured); kernel OOM killer selects Fluentd process |
| 08:33 | Fluentd on node-09 OOM-killed; pod enters CrashLoopBackOff; kubelet begins log-shipping backpressure on node-09 |
| 08:35 | Fluentd pods on node-03, node-11, node-17 OOM-killed within 90 seconds of each other |
| 08:36 | Memory pressure condition on node-03 triggers kubelet eviction; two Burstable-class pods evicted from node-03 |
| 08:37 | On 14 of 22 nodes, Fluentd pods are OOM-killed; kubelet memory pressure on 3 nodes triggers pod evictions |
| 08:38 | First 502 errors appear in API gateway logs as evicted pods are not immediately replaced |
| 08:39 | PagerDuty alert: API error rate exceeds 1% threshold (observed: 2.1%) |
| 08:40 | On-call engineer Rashida Mbeki acknowledges alert; opens Grafana |
| 08:42 | Rashida observes API errors correlate with pods on specific nodes; checks node health |
| 08:44 | kubectl get pods -A -o wide | grep OOMKilled shows 14 Fluentd pods OOM-killed; Rashida pages Observability Platform team |
| 08:45 | Amara Osei-Bonsu (Observability) joins; immediately suspects log plugin issue |
| 08:46 | Amara runs kubectl top pods -n logging; remaining 8 Fluentd pods showing 480–510 MiB memory, approaching OOM |
| 08:47 | Amara and Rashida agree on immediate mitigation: restart Fluentd DaemonSet with a temporary memory limit and identify the malformed log source |
| 08:49 | Amara applies kubectl set resources daemonset fluentd -n logging --limits=memory=256Mi — too low; Fluentd pods immediately OOM again |
| 08:51 | Amara raises limit to 600Mi and disables the JSON parse plugin temporarily by patching the ConfigMap |
| 08:53 | Fluentd pods begin restarting with updated config; JSON parse plugin disabled, memory stable |
| 08:55 | Rashida identifies catalog-indexer as source of malformed JSON by grepping raw logs on a node |
| 08:56 | Incident Commander Kwame Asante joins; decision to roll back catalog-indexer to v1.3.8 |
| 08:58 | catalog-indexer rollback initiated via ArgoCD |
| 09:00 | All Fluentd pods healthy; memory stabilizing at 80–120 MiB |
| 09:04 | catalog-indexer v1.3.8 running; malformed JSON emission stops |
| 09:07 | API error rate returns to baseline (<0.1%); all evicted pods rescheduled |
| 09:10 | All-clear declared |
| 09:30 | Post-incident: Fluentd plugin updated to v0.2.1 (upstream fix for this exact leak) in staging; validation begins |
Impact¶
Customer Impact¶
API error rate peaked at 4.2% for approximately 18 minutes (08:38–09:07 UTC). Approximately 6,300 API requests returned 502 (upstream not available) or 504 (gateway timeout) during this window, based on gateway access logs. Affected endpoints included the product search API, catalog browsing, and the order submission endpoint. Three enterprise API partners received SLA-breach notifications (SLA threshold: 99.5% success rate per hour; all three dropped below 99% for the affected hour).
Internal Impact¶
Observability Platform spent 6 hours post-incident validating the updated Fluentd plugin in staging and rolling it to production. Core Infrastructure spent 2 hours auditing node memory headroom and eviction policies. Python Services spent 3 hours reviewing the custom logger and fixing the malformed JSON bug. Incident response: approximately 4 engineer-hours during the active window. Total: ~15 engineer-hours.
Data Impact¶
No data loss. Logs from the 52-minute incident window on the 14 affected nodes were partially lost (Fluentd buffer not flushed before OOM kill). Approximately 18 minutes of application logs from those nodes are missing from the centralized logging cluster. The gap is noted in the logging SLO tracking dashboard.
Root Cause¶
What Happened (Technical)¶
The catalog-indexer service v1.4.0 introduced a new exception handler that used a hand-rolled JSON serializer rather than the standard library's json.dumps(). The serializer opened a JSON object ({), wrote key-value pairs iteratively, and was supposed to close the object (}) after the final field. A missing finally block meant that if an exception occurred mid-serialization (which happened on ~5% of catalog entries that contained Unicode characters outside the BMP), the log line was emitted without the closing brace: {"level":"error","msg":"index failed","id":12345 instead of {"level":"error","msg":"index failed","id":12345}.
Fluentd's JSON parse plugin (version 0.1.2) uses a streaming parser that maintains a buffer of partially parsed JSON. When a line fails to parse, the plugin is supposed to discard the buffer and continue. A known bug in v0.1.2 (fixed in v0.2.0, released 6 months prior) causes the buffer reference to be retained in a Ruby hash that is never cleared, even after the parse failure is logged. Each malformed line leaks approximately 2–4 KiB of buffer memory. At catalog-indexer's log emission rate (~800 lines/minute across all instances, 5% malformed = ~40 leaking lines/minute), each Fluentd pod accumulated roughly 80–160 KiB of leaked memory per minute. After 14 minutes, pods had grown by 1.1–2.2 MiB — modest in absolute terms, but Fluentd pods were already running at 460–480 MiB RSS under normal load with no headroom.
Because no memory limit was set on the Fluentd DaemonSet, pods were classified as Burstable QoS by Kubernetes. The kernel OOM killer targets Burstable processes before Guaranteed ones when node memory is under pressure. This ordering meant Fluentd was killed first across most nodes, rather than customer-facing pods with Guaranteed QoS (explicitly set resource requests == limits). On three nodes, the OOM kills caused sufficient memory pressure that the kubelet's eviction manager also evicted some Burstable customer-facing pods before the Guaranteed workloads took over.
Contributing Factors¶
-
Fluentd plugin was 8 months out of date with a known upstream fix: The
fluent-plugin-json-parseplugin had a known memory leak fix available in v0.2.0 (released 2024-11-08). The Observability Platform team's plugin update policy requires quarterly reviews but the last review was skipped due to Q1 roadmap pressure. A dependency update bot (Renovate) was not configured for the Fluentd image or its plugins. -
No memory limit on the Fluentd DaemonSet: Fluentd had no
resources.limits.memoryset, only aresources.requests.memory: 256Mi. This meant the pods were Burstable and could consume arbitrary amounts of node memory before the OOM killer intervened. A memory limit would have caused Fluentd to OOM earlier and more predictably, and would have isolated the fault to the logging subsystem rather than triggering node-level memory pressure. -
The malformed JSON came from a custom logger instead of the standard library: The Python Services team's coding standards recommend using
structlogwith the standardjsonmodule for structured logging. The author ofcatalog-indexerv1.4.0 implemented a custom JSON serializer to add performance profiling fields, bypassing the standard library. No code review comment flagged this deviation. A linter rule or test asserting that all log output is valid JSON would have caught the bug before deployment.
What We Got Lucky About¶
- Fluentd's Burstable QoS classification meant it was OOM-killed before Guaranteed customer-facing pods on 19 of 22 nodes. On the 3 nodes where Guaranteed pods were eventually affected, the workloads rescheduled within 90 seconds on other nodes. The Kubernetes scheduler had sufficient capacity across the remaining healthy nodes to absorb the evicted pods without saturating any node.
- The upstream fix for this exact Fluentd plugin bug was already available and had been tested against the same Fluentd version in a community fork. The upgrade path was straightforward and validated in staging within 2 hours, allowing production rollout the same day.
Detection¶
How We Detected¶
PagerDuty alert on API error rate exceeding 1% (08:39 UTC), triggered by the API gateway's error rate metric in Datadog. This was 22 minutes after Fluentd began leaking memory and 8 minutes after the first OOM kills. The alert was the primary detection mechanism.
Why We Didn't Detect Sooner¶
Fluentd memory growth was not monitored with an alert. The Observability Platform team had a Grafana dashboard for Fluentd pod memory, but no alert was configured on it. Memory grew gradually over 14 minutes before the first OOM kill; a slow-ramp alert (e.g., memory growth rate > X MiB/min over 5 minutes) would have detected the issue before any pods were killed. The malformed JSON source was not detectable from existing log parsing error metrics because the plugin silently dropped malformed lines without incrementing a counter.
Response¶
What Went Well¶
- Rashida's pivot from "API errors" to "node-level OOM" was fast and correct. Checking
kubectl get pods -A | grep OOMKilledas a first step after seeing node-correlated errors is good instinct and reflects the on-call runbook's guidance. - The decision to disable the JSON parse plugin as a temporary mitigation (rather than trying to find and fix the malformed log source first) was the right call. It stopped the bleeding in 2 minutes. The root cause investigation could safely proceed after stabilization.
- ArgoCD rollback for
catalog-indexerwas available and fast (90 seconds to full rollback). The team did not attempt an in-place hotfix.
What Went Poorly¶
- The first memory limit Amara applied (256Mi) was too conservative and caused immediate re-OOM of the restarting Fluentd pods, adding 2 minutes of unnecessary churn. The correct limit (600Mi, based on observed stable-state RSS) should have been derived before applying the change.
- The Fluentd plugin was 8 months out of date. A Renovate or Dependabot configuration for the Fluentd Docker image and plugin manifest would have surfaced the available patch automatically.
- Log loss during the incident was not quantified until 4 hours after the all-clear. The team should have a defined procedure for auditing log gaps immediately after log-shipping incidents, as part of the incident runbook.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| AI-008-01 | Update fluent-plugin-json-parse to v0.2.1 in production Fluentd image; add to Renovate config for automatic patch version updates |
P0 | Observability Platform | Open | 2025-05-16 |
| AI-008-02 | Set resources.limits.memory: 600Mi on Fluentd DaemonSet; configure Kubernetes eviction priority to ensure Fluentd is evicted before all other workloads |
P0 | Core Infrastructure | Open | 2025-05-16 |
| AI-008-03 | Add Grafana alert: Fluentd pod memory growth rate > 20 MiB/min over 3 minutes; page Observability on-call | P1 | Observability Platform | Open | 2025-05-21 |
| AI-008-04 | Add pre-commit lint rule in Python Services repos: run python -c "import json; json.loads(line)" on sampled log output in test suite; fail if > 0% malformed |
P1 | Python Services | Open | 2025-05-28 |
| AI-008-05 | Add log gap audit step to incident runbook for log-shipping incidents: quantify missing log window and post to incident record | P2 | Observability Platform | Open | 2025-05-30 |
| AI-008-06 | Add Fluentd plugin parse error counter metric; alert if parse error rate exceeds 0.1% of ingested lines over 5 minutes | P2 | Observability Platform | Open | 2025-06-06 |
Lessons Learned¶
- Observability infrastructure is not exempt from resource limits: The instinct to leave monitoring agents unconstrained ("we don't want the log shipper to OOM") backfires when the agent has a bug. A memory limit causes a controlled, isolated failure. No limit causes node-level memory pressure that cascades to customer workloads. Set limits on all DaemonSet pods.
- A slow memory leak is harder to detect than a crash: A process that crashes immediately produces an alert. A process that leaks memory at 100 KiB/min for 14 minutes before dying looks healthy in most dashboards right up until the OOM kill. Rate-of-change alerts on memory are necessary to catch leaks before they become outages.
- Deviating from standard library tooling in hot paths needs review scrutiny: The root cause was a hand-rolled JSON serializer in a logging path. Logging code is high-frequency and high-consequence for observability. Code review should flag any deviation from
structlog/json.dumps()in a logging context and require explicit justification.
Cross-References¶
- Failure Pattern: Memory Leak — Unbounded Accumulation from Malformed Input; Cascading OOM from Shared Infrastructure Component
- Topic Packs: Kubernetes Resource QoS and Eviction; Fluentd and Log Shipping Architecture; Python Structured Logging; DaemonSet Operations
- Runbook:
runbooks/observability/fluentd-oom-recovery.md;runbooks/kubernetes/oom-kill-triage.md - Decision Tree: OOM Kills on Node → Check DaemonSet pods first → Isolate to logging/monitoring tier → Disable plugin vs Restart pod → Identify malformed input source → Roll back emitting service