Skip to content

Portal | Level: L2: Operations | Topics: Log Pipelines, Logging, Loki | Domain: Observability

Log Pipelines - Primer

Why This Matters

Logs are the exhaust of your infrastructure. Every process, service, and kernel event writes log data somewhere. The challenge is not generating logs — it is collecting them from hundreds of sources, parsing them into something queryable, routing them to the right destination, and doing it all without losing data or drowning your storage.

A log pipeline is the plumbing between "application wrote a line to a file" and "engineer runs a query in Grafana." If the pipeline is broken, you are blind. If the pipeline is slow, your incident response is slow. If the pipeline drops data, you cannot do forensics.

Analogy: A log pipeline is plumbing. Sources are faucets, buffers are water tanks, destinations are sinks. If the drain is slow (destination overloaded), water backs up into the tank (buffer fills). When the tank overflows, you either flood the house (block the app) or let water run onto the floor (drop logs). Good plumbing means right-sized pipes and overflow drains.


Log Pipeline Architecture

┌─────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Sources    │────▶│  Collection  │────▶│  Processing  │────▶│ Destinations │
│              │     │              │     │              │     │              │
│ App logs     │     │ Fluentbit    │     │ Parse        │     │ Elasticsearch│
│ System logs  │     │ Fluentd      │     │ Filter       │     │ S3 / GCS     │
│ Container    │     │ Vector       │     │ Enrich       │     │ Loki         │
│ stdout/err   │     │ Filebeat     │     │ Route        │     │ Kafka        │
│ Syslog       │     │ Promtail     │     │ Buffer       │     │ Datadog      │
└─────────────┘     └──────────────┘     └──────────────┘     └──────────────┘

Key Concepts

Concept What It Means
Structured logging JSON or key-value pairs vs free-text lines
Parsing Extracting fields from unstructured text
Routing Sending different logs to different destinations
Buffering Holding logs in memory/disk when the destination is slow
Backpressure What happens when the pipeline is full
At-least-once Guarantee: logs may be duplicated but not lost
Exactly-once Holy grail, rarely achievable in practice

Structured vs Unstructured Logs

Unstructured (Bad for Pipelines)

Mar 15 14:23:01 web1 nginx: 192.168.1.1 - - [15/Mar/2024:14:23:01 +0000] "GET /api/users HTTP/1.1" 200 1234

You need a regex to extract the IP, status code, path. If the format changes, the regex breaks.

Structured (Good for Pipelines)

{
  "timestamp": "2024-03-15T14:23:01Z",
  "host": "web1",
  "service": "nginx",
  "client_ip": "192.168.1.1",
  "method": "GET",
  "path": "/api/users",
  "status": 200,
  "bytes": 1234
}

Fields are already extracted. No parsing needed. Every tool in the pipeline can work with it directly.

Make the Decision at the Source

Application code  ──▶  Write JSON logs  ──▶  Pipeline reads JSON  ──▶  No parsing needed
Application code  ──▶  Write free text  ──▶  Pipeline parses regex  ──▶  Fragile, slow

If you control the application, always emit structured logs. Parsing is for things you do not control (system logs, third-party apps).

Remember: "Structure at the source, parse at the edge." The cheapest place to add structure to logs is in the application code (JSON output). The most expensive place is in the pipeline (regex parsing). Every regex parser is a maintenance burden and a latency cost. Convince developers to emit JSON and you eliminate an entire category of pipeline problems.


Parsing Strategies

When you must parse, you have options:

Regex Parsing

# Nginx combined log format
^(?<client>[^ ]+) [^ ]+ (?<user>[^ ]+) \[(?<time>[^\]]+)\] "(?<method>\w+) (?<path>[^ ]+) [^"]+" (?<status>\d+) (?<bytes>\d+)

Pros: Flexible, handles any format. Cons: Slow, hard to maintain, breaks when format changes.

JSON Parsing

# Built into every log tool
# Fluentbit: parser "json"
# Vector: codec = "json"

Pros: Fast, reliable. Cons: Only works if the source emits JSON.

Key-Value Parsing

# Input: user=alice action=login status=success duration=0.45s
# Output: {user: "alice", action: "login", status: "success", duration: "0.45s"}

Pros: Common in application logs, easy to parse. Cons: No standard escaping for values with spaces.

Delimiter / CSV Parsing

# Split on space, tab, comma, pipe
# Good for access logs with predictable formats

Grok Patterns (Logstash Heritage)

# Named patterns that compose
%{IP:client_ip} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:status}

Pros: Readable, reusable. Cons: Still regex under the hood, still slow.


The Big Three: Fluentbit, Fluentd, Vector

Fluentbit

  • Written in C, tiny memory footprint (~2MB)
  • Ideal for edge collection (run on every node)
  • Config is INI-style (simple)
  • Limited transformation capabilities
  • Common in Kubernetes (DaemonSet)
# /etc/fluent-bit/fluent-bit.conf

[SERVICE]
    Flush        5
    Log_Level    info
    Daemon       off
    Parsers_File parsers.conf

[INPUT]
    Name         tail
    Path         /var/log/app/*.log
    Tag          app.*
    Parser       json
    Refresh_Interval 5
    Mem_Buf_Limit 10MB

[FILTER]
    Name         modify
    Match        app.*
    Add          hostname ${HOSTNAME}
    Add          environment production

[OUTPUT]
    Name         forward
    Match        *
    Host         fluentd-aggregator
    Port         24224

[OUTPUT]
    Name         es
    Match        app.*
    Host         elasticsearch
    Port         9200
    Index        app-logs
    Type         _doc

Fluentd

  • Written in Ruby + C
  • Rich plugin ecosystem (800+ plugins)
  • Good as an aggregator (central processing)
  • Higher memory usage than Fluentbit
  • Config is XML-like (more verbose)
# /etc/fluentd/fluent.conf

<source>
  @type forward
  port 24224
  bind 0.0.0.0
</source>

<filter app.**>
  @type parser
  key_name log
  <parse>
    @type json
  </parse>
</filter>

<filter app.**>
  @type record_transformer
  <record>
    cluster production-east
  </record>
</filter>

<match app.**>
  @type elasticsearch
  host elasticsearch.internal
  port 9200
  index_name app-logs
  <buffer>
    @type file
    path /var/log/fluentd/buffer/es
    flush_interval 5s
    chunk_limit_size 8MB
    total_limit_size 2GB
    retry_max_interval 30s
    overflow_action block
  </buffer>
</match>

Vector

  • Written in Rust, high performance
  • Single binary, does collection + aggregation
  • Config is TOML (or YAML)
  • Strong typing and transforms (VRL language)
  • Growing ecosystem, fewer plugins than Fluentd
# /etc/vector/vector.toml

[sources.app_logs]
type = "file"
include = ["/var/log/app/*.log"]
read_from = "beginning"

[transforms.parse_json]
type = "remap"
inputs = ["app_logs"]
source = '''
  . = parse_json!(.message)
  .timestamp = now()
  .hostname = get_hostname!()
'''

[transforms.filter_errors]
type = "filter"
inputs = ["parse_json"]
condition = '.level == "error" || .level == "fatal"'

[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_json"]
endpoints = ["http://elasticsearch:9200"]
bulk.index = "app-logs-%Y-%m-%d"

[sinks.error_alerts]
type = "http"
inputs = ["filter_errors"]
uri = "https://alerts.example.com/webhook"
encoding.codec = "json"

Choosing Between Them

> **Who made it:** Fluentd was created by Sadayuki Furuhashi (known as "frsyuki") at Treasure Data in 2011. It became a CNCF Graduated project in 2019. Fluent Bit was created by Eduardo Silva at Treasure Data in 2015 as a lightweight C alternative for resource-constrained environments. Vector was created by Timber Technologies (now part of Datadog) in 2019, written in Rust for maximum throughput.

Need lightweight edge agent?        → Fluentbit
Need rich plugin ecosystem?         → Fluentd
Need high throughput + transforms?  → Vector
Kubernetes DaemonSet?               → Fluentbit (or Vector)
Central aggregation layer?          → Fluentd or Vector
Already in the Fluentd ecosystem?   → Fluentbit → Fluentd
Greenfield deployment?              → Vector (modern, fast)

Buffering and Backpressure

The most critical part of a log pipeline is what happens when the destination is slow or down.

                    Normal Flow
Source ──▶ Buffer ──▶ Destination (fast)

                    Backpressure
Source ──▶ Buffer ──▶ Destination (slow/down)
          Buffer fills up
          What happens?

Option 1: Block    → Source slows down (safe, but app may stall)
Option 2: Drop     → Oldest or newest logs discarded (data loss)
Option 3: Overflow → Write to disk when memory buffer is full

Buffer Configuration Pattern

Memory buffer (fast, limited):
  - First tier, handles normal traffic
  - Set a cap (e.g., 64MB)

File buffer (slower, larger):
  - Overflow from memory buffer
  - Survives process restarts
  - Set a cap (e.g., 2GB)

When both are full:
  - Block: stop accepting new logs (protect data, risk app stall)
  - Drop: discard logs (protect app, lose data)

Routing and Tagging

Most pipelines use tags or labels to route logs to different destinations:

app.web.access  ──▶  Elasticsearch (hot, 7 days)
app.web.error   ──▶  Elasticsearch (hot, 30 days) + PagerDuty
infra.syslog    ──▶  S3 (cold, 365 days)
security.auth   ──▶  SIEM + S3 (compliance, 7 years)
debug.*         ──▶  /dev/null (in production)

This is what makes pipelines powerful — different log types get different treatment. Debug logs go to cheap storage (or nowhere). Security logs go to the SIEM with long retention. Application errors go to the place your team actually searches.


Multiline Log Handling

Stack traces and multi-line log entries need special treatment:

# A Java stack trace is one logical entry split across many lines:
2024-03-15 14:23:01 ERROR NullPointerException
    at com.example.Service.process(Service.java:42)
    at com.example.Handler.handle(Handler.java:18)
    at java.lang.Thread.run(Thread.java:829)
# Fluentbit multiline parser
[MULTILINE_PARSER]
    name          java_stack
    type          regex
    flush_timeout 1000
    rule          "start_state" "/^\d{4}-\d{2}-\d{2}/" "cont"
    rule          "cont"        "/^\s+(at|Caused|\.{3})/" "cont"

[INPUT]
    Name              tail
    Path              /var/log/app/app.log
    multiline.parser  java_stack

Without multiline handling, each line of a stack trace becomes a separate log entry — useless for debugging.


Metrics to Monitor Your Pipeline

A log pipeline without monitoring is a log pipeline you will not know is broken:

- Input rate (events/sec)
- Output rate (events/sec)
- Buffer usage (bytes, % full)
- Retry count (destination failures)
- Drop count (lost events)
- Parse error count (bad format)
- Latency (time from input to destination)

When input rate > output rate persistently, your buffer is filling. Act before it overflows.

Gotcha: The most dangerous log pipeline failure is silent data loss. If your pipeline drops logs without alerting, you will not know until an incident investigation comes up empty. Always monitor the drop count metric and alert on it. A pipeline that blocks (slows the app) is safer than one that silently drops -- at least you notice the slowdown.


Wiki Navigation

Prerequisites