Skip to content

Pattern: Unstructured Logging

ID: FP-045 Family: Observability Gap Frequency: Very Common Blast Radius: Entire service (incident response degraded) Detection Difficulty: Subtle

The Shape

Logs written as free-form strings ("Processing order 12345 for user 67890") cannot be reliably parsed, aggregated, or alerted on. During an incident, engineers must grep through gigabytes of text. Aggregation is impossible (how many orders failed?). Alerting on log content requires fragile regex. Correlation across services is manual. Structured logs (JSON with consistent field names) enable log-based alerting, aggregation, and correlation — turning logs from an incident debugger into an observability system.

How You'll See It

In Kubernetes

# Unstructured (can't alert, aggregate, or correlate):
2024-01-15 10:23:45 ERROR Failed to process payment for order 12345 user 67890 error: timeout

# Structured (alertable, aggregatable, correlatable):
{"timestamp":"2024-01-15T10:23:45Z","level":"ERROR","event":"payment_failed","order_id":"12345","user_id":"67890","error":"timeout","duration_ms":5000,"trace_id":"abc123"}
With structured logs: count where level=ERROR and event=payment_failed group by error is a single query. With unstructured logs: requires regex, is brittle, and breaks when someone changes the log message wording.

In Linux/Infrastructure

500GB of /var/log/app.log with free-form text. During an incident: grep "ERROR" app.log | tail -100 shows 100 lines; no context about frequency, affected users, or correlation. With structured logs to a log aggregation system (Loki, Elasticsearch): instant count, filter by user, group by error type.

In CI/CD

CI build logs are unstructured text. Test failures are lines in a multi-megabyte log file. No structured output means no automated failure categorization, no trend analysis on flaky tests, no alerting on increased failure rates.

The Tell

Incident postmortem contains: "we spent 2 hours grepping through logs." Log aggregation queries require fragile regex patterns. Correlation between services requires manually matching timestamps in different log files. No log-based alerts exist (because alerts require parseable fields).

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Insufficient log volume Insufficient log structure More logs don't help if you can't query them efficiently
Application working fine Issues invisible in unstructured logs Add structured logging; errors that were invisible become quantifiable
Logging infrastructure insufficient Log format insufficient The infrastructure is fine; the format is unqueryable

The Fix (Generic)

  1. Immediate: For the most critical service, switch to structured JSON logging: use a logging library (structlog for Python, zap for Go, winston for Node.js) that outputs JSON by default.
  2. Short-term: Add minimum required fields to all log lines: timestamp, level, event, trace_id (for correlation), and domain-specific fields (order_id, user_id).
  3. Long-term: Define an organization-wide log schema; enforce it with log validation in CI; create standard Loki/Elasticsearch queries for common incident patterns.

Real-World Examples

  • Example 1: E-commerce incident: "how many orders failed in the last hour?" With unstructured logs: 2 engineers, 30 minutes, fragile awk command. With structured logs: single Kibana query, 10 seconds.
  • Example 2: Authentication failures were invisible because the log message format had changed 6 months earlier. The alert was looking for "auth failed" but the new message was "authentication error." Structured logging with an event: auth_failure field doesn't break when message text changes.

War Story

3am page: payment service errors. "How bad is it?" We didn't know. Our logs were giant strings. We ran grep -c "ERROR" app.log — 15,000 errors in the last hour. Were they all the same? Different? From one user or all users? grep "payment failed" | sort | uniq -c | sort -rn — took 4 minutes to run on 20GB. We made decisions based on vibes. Next week we switched to structlog (Python); every log line was JSON. Same incident 3 months later: Loki query in 2 seconds, breakdown by error type, affected user count, average duration. Resolved in 15 minutes instead of 90.

Cross-References