Skip to content

Postmortem: Memory Leak in Log Shipping Agent Causes Fleet-Wide OOM Kills

Field Value
ID PM-008
Date 2025-05-14
Severity SEV-2
Duration 52m (detection to resolution)
Time to Detect 14m
Time to Mitigate 52m
Customer Impact API error rate reached 4.2% for 18 minutes; approximately 6,300 requests returned 502 or 504
Revenue Impact ~$5,800 estimated (failed API transactions, partner SLA credits)
Teams Involved Observability Platform, Core Infrastructure, Python Services, Incident Command
Postmortem Author Amara Osei-Bonsu
Postmortem Date 2025-05-17

Executive Summary

On 2025-05-14, a newly deployed Python microservice (catalog-indexer) began emitting approximately 5% malformed JSON log lines due to a bug in its custom logger. These malformed lines triggered a memory leak in the fleet-wide Fluentd JSON parsing plugin, causing all 22 Fluentd DaemonSet pods to gradually exhaust their memory and be OOM-killed by the Linux kernel. As Fluentd pods entered crash loops, the kubelet on affected nodes began experiencing memory pressure, leading to cascading pod evictions. The incident lasted 52 minutes from first alert to full fleet recovery. Because Fluentd pods have Burstable QoS (no memory limit set), they absorbed the OOM kills before Guaranteed-class customer-facing workloads were affected on most nodes.

Timeline (All times UTC)

Time Event
08:15 catalog-indexer v1.4.0 deployed to production via ArgoCD; service uses a custom Python logger class that formats JSON manually
08:17 catalog-indexer begins emitting log lines; ~5% of lines are malformed (missing closing }) due to exception path in the custom logger not properly closing the JSON object
08:17 Fluentd pods on all 22 nodes begin processing catalog-indexer logs via the fluent-plugin-json-parse plugin (version 0.1.2, last updated 8 months prior)
08:19 Memory growth in Fluentd pods begins; plugin holds references to partially parsed JSON buffers that are never freed on parse failure
08:31 Fluentd pod on node-09 reaches 512 MiB RSS (no memory limit configured); kernel OOM killer selects Fluentd process
08:33 Fluentd on node-09 OOM-killed; pod enters CrashLoopBackOff; kubelet begins log-shipping backpressure on node-09
08:35 Fluentd pods on node-03, node-11, node-17 OOM-killed within 90 seconds of each other
08:36 Memory pressure condition on node-03 triggers kubelet eviction; two Burstable-class pods evicted from node-03
08:37 On 14 of 22 nodes, Fluentd pods are OOM-killed; kubelet memory pressure on 3 nodes triggers pod evictions
08:38 First 502 errors appear in API gateway logs as evicted pods are not immediately replaced
08:39 PagerDuty alert: API error rate exceeds 1% threshold (observed: 2.1%)
08:40 On-call engineer Rashida Mbeki acknowledges alert; opens Grafana
08:42 Rashida observes API errors correlate with pods on specific nodes; checks node health
08:44 kubectl get pods -A -o wide | grep OOMKilled shows 14 Fluentd pods OOM-killed; Rashida pages Observability Platform team
08:45 Amara Osei-Bonsu (Observability) joins; immediately suspects log plugin issue
08:46 Amara runs kubectl top pods -n logging; remaining 8 Fluentd pods showing 480–510 MiB memory, approaching OOM
08:47 Amara and Rashida agree on immediate mitigation: restart Fluentd DaemonSet with a temporary memory limit and identify the malformed log source
08:49 Amara applies kubectl set resources daemonset fluentd -n logging --limits=memory=256Mi — too low; Fluentd pods immediately OOM again
08:51 Amara raises limit to 600Mi and disables the JSON parse plugin temporarily by patching the ConfigMap
08:53 Fluentd pods begin restarting with updated config; JSON parse plugin disabled, memory stable
08:55 Rashida identifies catalog-indexer as source of malformed JSON by grepping raw logs on a node
08:56 Incident Commander Kwame Asante joins; decision to roll back catalog-indexer to v1.3.8
08:58 catalog-indexer rollback initiated via ArgoCD
09:00 All Fluentd pods healthy; memory stabilizing at 80–120 MiB
09:04 catalog-indexer v1.3.8 running; malformed JSON emission stops
09:07 API error rate returns to baseline (<0.1%); all evicted pods rescheduled
09:10 All-clear declared
09:30 Post-incident: Fluentd plugin updated to v0.2.1 (upstream fix for this exact leak) in staging; validation begins

Impact

Customer Impact

API error rate peaked at 4.2% for approximately 18 minutes (08:38–09:07 UTC). Approximately 6,300 API requests returned 502 (upstream not available) or 504 (gateway timeout) during this window, based on gateway access logs. Affected endpoints included the product search API, catalog browsing, and the order submission endpoint. Three enterprise API partners received SLA-breach notifications (SLA threshold: 99.5% success rate per hour; all three dropped below 99% for the affected hour).

Internal Impact

Observability Platform spent 6 hours post-incident validating the updated Fluentd plugin in staging and rolling it to production. Core Infrastructure spent 2 hours auditing node memory headroom and eviction policies. Python Services spent 3 hours reviewing the custom logger and fixing the malformed JSON bug. Incident response: approximately 4 engineer-hours during the active window. Total: ~15 engineer-hours.

Data Impact

No data loss. Logs from the 52-minute incident window on the 14 affected nodes were partially lost (Fluentd buffer not flushed before OOM kill). Approximately 18 minutes of application logs from those nodes are missing from the centralized logging cluster. The gap is noted in the logging SLO tracking dashboard.

Root Cause

What Happened (Technical)

The catalog-indexer service v1.4.0 introduced a new exception handler that used a hand-rolled JSON serializer rather than the standard library's json.dumps(). The serializer opened a JSON object ({), wrote key-value pairs iteratively, and was supposed to close the object (}) after the final field. A missing finally block meant that if an exception occurred mid-serialization (which happened on ~5% of catalog entries that contained Unicode characters outside the BMP), the log line was emitted without the closing brace: {"level":"error","msg":"index failed","id":12345 instead of {"level":"error","msg":"index failed","id":12345}.

Fluentd's JSON parse plugin (version 0.1.2) uses a streaming parser that maintains a buffer of partially parsed JSON. When a line fails to parse, the plugin is supposed to discard the buffer and continue. A known bug in v0.1.2 (fixed in v0.2.0, released 6 months prior) causes the buffer reference to be retained in a Ruby hash that is never cleared, even after the parse failure is logged. Each malformed line leaks approximately 2–4 KiB of buffer memory. At catalog-indexer's log emission rate (~800 lines/minute across all instances, 5% malformed = ~40 leaking lines/minute), each Fluentd pod accumulated roughly 80–160 KiB of leaked memory per minute. After 14 minutes, pods had grown by 1.1–2.2 MiB — modest in absolute terms, but Fluentd pods were already running at 460–480 MiB RSS under normal load with no headroom.

Because no memory limit was set on the Fluentd DaemonSet, pods were classified as Burstable QoS by Kubernetes. The kernel OOM killer targets Burstable processes before Guaranteed ones when node memory is under pressure. This ordering meant Fluentd was killed first across most nodes, rather than customer-facing pods with Guaranteed QoS (explicitly set resource requests == limits). On three nodes, the OOM kills caused sufficient memory pressure that the kubelet's eviction manager also evicted some Burstable customer-facing pods before the Guaranteed workloads took over.

Contributing Factors

  1. Fluentd plugin was 8 months out of date with a known upstream fix: The fluent-plugin-json-parse plugin had a known memory leak fix available in v0.2.0 (released 2024-11-08). The Observability Platform team's plugin update policy requires quarterly reviews but the last review was skipped due to Q1 roadmap pressure. A dependency update bot (Renovate) was not configured for the Fluentd image or its plugins.

  2. No memory limit on the Fluentd DaemonSet: Fluentd had no resources.limits.memory set, only a resources.requests.memory: 256Mi. This meant the pods were Burstable and could consume arbitrary amounts of node memory before the OOM killer intervened. A memory limit would have caused Fluentd to OOM earlier and more predictably, and would have isolated the fault to the logging subsystem rather than triggering node-level memory pressure.

  3. The malformed JSON came from a custom logger instead of the standard library: The Python Services team's coding standards recommend using structlog with the standard json module for structured logging. The author of catalog-indexer v1.4.0 implemented a custom JSON serializer to add performance profiling fields, bypassing the standard library. No code review comment flagged this deviation. A linter rule or test asserting that all log output is valid JSON would have caught the bug before deployment.

What We Got Lucky About

  1. Fluentd's Burstable QoS classification meant it was OOM-killed before Guaranteed customer-facing pods on 19 of 22 nodes. On the 3 nodes where Guaranteed pods were eventually affected, the workloads rescheduled within 90 seconds on other nodes. The Kubernetes scheduler had sufficient capacity across the remaining healthy nodes to absorb the evicted pods without saturating any node.
  2. The upstream fix for this exact Fluentd plugin bug was already available and had been tested against the same Fluentd version in a community fork. The upgrade path was straightforward and validated in staging within 2 hours, allowing production rollout the same day.

Detection

How We Detected

PagerDuty alert on API error rate exceeding 1% (08:39 UTC), triggered by the API gateway's error rate metric in Datadog. This was 22 minutes after Fluentd began leaking memory and 8 minutes after the first OOM kills. The alert was the primary detection mechanism.

Why We Didn't Detect Sooner

Fluentd memory growth was not monitored with an alert. The Observability Platform team had a Grafana dashboard for Fluentd pod memory, but no alert was configured on it. Memory grew gradually over 14 minutes before the first OOM kill; a slow-ramp alert (e.g., memory growth rate > X MiB/min over 5 minutes) would have detected the issue before any pods were killed. The malformed JSON source was not detectable from existing log parsing error metrics because the plugin silently dropped malformed lines without incrementing a counter.

Response

What Went Well

  1. Rashida's pivot from "API errors" to "node-level OOM" was fast and correct. Checking kubectl get pods -A | grep OOMKilled as a first step after seeing node-correlated errors is good instinct and reflects the on-call runbook's guidance.
  2. The decision to disable the JSON parse plugin as a temporary mitigation (rather than trying to find and fix the malformed log source first) was the right call. It stopped the bleeding in 2 minutes. The root cause investigation could safely proceed after stabilization.
  3. ArgoCD rollback for catalog-indexer was available and fast (90 seconds to full rollback). The team did not attempt an in-place hotfix.

What Went Poorly

  1. The first memory limit Amara applied (256Mi) was too conservative and caused immediate re-OOM of the restarting Fluentd pods, adding 2 minutes of unnecessary churn. The correct limit (600Mi, based on observed stable-state RSS) should have been derived before applying the change.
  2. The Fluentd plugin was 8 months out of date. A Renovate or Dependabot configuration for the Fluentd Docker image and plugin manifest would have surfaced the available patch automatically.
  3. Log loss during the incident was not quantified until 4 hours after the all-clear. The team should have a defined procedure for auditing log gaps immediately after log-shipping incidents, as part of the incident runbook.

Action Items

ID Action Priority Owner Status Due Date
AI-008-01 Update fluent-plugin-json-parse to v0.2.1 in production Fluentd image; add to Renovate config for automatic patch version updates P0 Observability Platform Open 2025-05-16
AI-008-02 Set resources.limits.memory: 600Mi on Fluentd DaemonSet; configure Kubernetes eviction priority to ensure Fluentd is evicted before all other workloads P0 Core Infrastructure Open 2025-05-16
AI-008-03 Add Grafana alert: Fluentd pod memory growth rate > 20 MiB/min over 3 minutes; page Observability on-call P1 Observability Platform Open 2025-05-21
AI-008-04 Add pre-commit lint rule in Python Services repos: run python -c "import json; json.loads(line)" on sampled log output in test suite; fail if > 0% malformed P1 Python Services Open 2025-05-28
AI-008-05 Add log gap audit step to incident runbook for log-shipping incidents: quantify missing log window and post to incident record P2 Observability Platform Open 2025-05-30
AI-008-06 Add Fluentd plugin parse error counter metric; alert if parse error rate exceeds 0.1% of ingested lines over 5 minutes P2 Observability Platform Open 2025-06-06

Lessons Learned

  1. Observability infrastructure is not exempt from resource limits: The instinct to leave monitoring agents unconstrained ("we don't want the log shipper to OOM") backfires when the agent has a bug. A memory limit causes a controlled, isolated failure. No limit causes node-level memory pressure that cascades to customer workloads. Set limits on all DaemonSet pods.
  2. A slow memory leak is harder to detect than a crash: A process that crashes immediately produces an alert. A process that leaks memory at 100 KiB/min for 14 minutes before dying looks healthy in most dashboards right up until the OOM kill. Rate-of-change alerts on memory are necessary to catch leaks before they become outages.
  3. Deviating from standard library tooling in hot paths needs review scrutiny: The root cause was a hand-rolled JSON serializer in a logging path. Logging code is high-frequency and high-consequence for observability. Code review should flag any deviation from structlog / json.dumps() in a logging context and require explicit justification.

Cross-References

  • Failure Pattern: Memory Leak — Unbounded Accumulation from Malformed Input; Cascading OOM from Shared Infrastructure Component
  • Topic Packs: Kubernetes Resource QoS and Eviction; Fluentd and Log Shipping Architecture; Python Structured Logging; DaemonSet Operations
  • Runbook: runbooks/observability/fluentd-oom-recovery.md; runbooks/kubernetes/oom-kill-triage.md
  • Decision Tree: OOM Kills on Node → Check DaemonSet pods first → Isolate to logging/monitoring tier → Disable plugin vs Restart pod → Identify malformed input source → Roll back emitting service