Skip to content

Log Pipelines Footguns

  1. Running with unbounded memory buffers. You configure memory-only buffers with no size limit. A traffic spike or destination outage causes the agent to consume all available memory, triggering the OOM killer. Now you have no logs AND a dead agent. Fix: Always set Mem_Buf_Limit (Fluentbit), total_limit_size (Fluentd), or equivalent. Use file-backed buffers as overflow. Decide on an explicit overflow policy: block (safe) or drop (lossy but stable).

    Default trap: Fluentbit's default Mem_Buf_Limit is unlimited. Fluentd's default total_limit_size is also unlimited. Both will happily consume all available RAM during a downstream outage. Set explicit limits on day one — not after your first OOM kill.

  2. Not tracking file positions (offsets) across restarts. The agent restarts and either re-reads the entire log file (flooding the pipeline with duplicates) or starts from the end (losing everything written during the restart window). Fix: Always configure a position database: DB in Fluentbit, pos_file in Fluentd, data_dir in Vector. Verify it works by restarting the agent and checking that no duplicates or gaps appear in the destination.

  3. Using regex parsing for everything. You write complex regex parsers for every log format. CPU usage on the agent climbs to 100% on one core and throughput drops to a fraction of what JSON parsing achieves. You scale the agent horizontally to compensate. Fix: Emit structured logs (JSON) from applications you control. Reserve regex parsing for third-party or system logs where you cannot change the format. Benchmark parsing performance and know your per-agent throughput ceiling.

    Scale note: JSON parsing in Fluentbit is 5-10x faster than regex parsing. On a single core, expect ~80K events/sec for JSON vs ~8K-15K for complex regex. If you control the application, switching from unstructured to JSON logs is the single biggest throughput win.

  4. Ignoring multiline log entries. Java stack traces, Python tracebacks, and multi-line SQL queries each get split into separate log entries. A 20-line stack trace becomes 20 independent, useless log records. Fix: Configure multiline parsers at the input stage. Match the start pattern (timestamp) and continuation pattern (whitespace/indentation). Test with real stack traces, not just happy-path single-line logs.

  5. No monitoring of the log pipeline itself. You have extensive monitoring for your application but zero visibility into the log pipeline. When the pipeline silently drops logs, you do not find out until an incident when you search for logs that do not exist. Fix: Monitor input rate, output rate, buffer usage, retry count, and drop count. Alert on output rate dropping below input rate for more than N minutes. Alert on buffer usage exceeding 80%.

  6. Setting retry_forever without backoff on all outputs. The destination is down and the agent retries as fast as possible, generating massive CPU and network load. When the destination recovers, it gets slammed with retry traffic and goes down again. Fix: Use exponential backoff with a max interval (retry_type exponential_backoff, retry_max_interval 60s). Only use retry_forever for critical log streams; for debug logs, set a finite retry limit and accept data loss.

  7. Forgetting about log rotation compatibility. The log agent tails /var/log/app/app.log. Logrotate renames it to app.log.1 and creates a new app.log. The agent keeps reading the old file handle (the renamed file) and never picks up the new file. Fix: Configure Rotate_Wait and Refresh_Interval in the tail input. Use copytruncate in logrotate if the agent supports it, or ensure the agent detects inode changes. Test rotation behavior explicitly.

  8. Sending all log levels to expensive hot storage. Debug and trace logs make up 80% of volume but are rarely queried. They all go to Elasticsearch hot nodes, driving up storage costs and degrading query performance for the logs that actually matter. Fix: Route logs by level or source. Send error and warn to hot storage with short retention. Send info to warm storage. Send debug to cold/archive storage (S3) or drop it in production entirely. Tag and route in the pipeline.

    Scale note: At scale, debug logs can cost 10-50x more than error logs in storage and indexing. A team sending 500GB/day of debug to Elasticsearch hot nodes at $0.10/GB/day spends $18K/year on logs nobody queries. Route by level and watch your bill drop.

  9. Running Fluentd as a high-throughput aggregator without multi-worker. Fluentd's Ruby GIL means one worker thread for processing. At 30K+ events/sec, a single Fluentd worker maxes out and becomes the bottleneck. Adding more sources does not help. Fix: Enable workers N in the <system> block (Fluentd 1.x). Or switch the aggregation layer to Vector (Rust, natively multi-threaded). Or run multiple Fluentd instances behind a TCP load balancer.

    Under the hood: Ruby's Global Interpreter Lock (GIL) means only one thread executes Ruby code at a time, even on multi-core machines. Fluentd's workers N spawns N separate processes (not threads) to work around this. Vector avoids the problem entirely — Rust has no GIL and scales linearly across cores.

  10. Testing pipeline changes in production. You update a parser regex in production, reload the agent, and the new regex does not match. All logs from that source are now unparsed blobs, or worse, dropped due to parse errors. You find out when someone searches for recent logs and finds nothing. Fix: Validate config syntax before deploy (--dry-run / validate). Test with sample data using stdout output. Deploy to a canary node first. Monitor parse error rate after every pipeline change.