Pattern: Unstructured Logging¶
ID: FP-045 Family: Observability Gap Frequency: Very Common Blast Radius: Entire service (incident response degraded) Detection Difficulty: Subtle
The Shape¶
Logs written as free-form strings ("Processing order 12345 for user 67890") cannot
be reliably parsed, aggregated, or alerted on. During an incident, engineers must
grep through gigabytes of text. Aggregation is impossible (how many orders failed?).
Alerting on log content requires fragile regex. Correlation across services is manual.
Structured logs (JSON with consistent field names) enable log-based alerting, aggregation,
and correlation — turning logs from an incident debugger into an observability system.
How You'll See It¶
In Kubernetes¶
# Unstructured (can't alert, aggregate, or correlate):
2024-01-15 10:23:45 ERROR Failed to process payment for order 12345 user 67890 error: timeout
# Structured (alertable, aggregatable, correlatable):
{"timestamp":"2024-01-15T10:23:45Z","level":"ERROR","event":"payment_failed","order_id":"12345","user_id":"67890","error":"timeout","duration_ms":5000,"trace_id":"abc123"}
count where level=ERROR and event=payment_failed group by error is
a single query. With unstructured logs: requires regex, is brittle, and breaks when
someone changes the log message wording.
In Linux/Infrastructure¶
500GB of /var/log/app.log with free-form text. During an incident: grep "ERROR" app.log | tail -100
shows 100 lines; no context about frequency, affected users, or correlation. With structured
logs to a log aggregation system (Loki, Elasticsearch): instant count, filter by user, group
by error type.
In CI/CD¶
CI build logs are unstructured text. Test failures are lines in a multi-megabyte log file. No structured output means no automated failure categorization, no trend analysis on flaky tests, no alerting on increased failure rates.
The Tell¶
Incident postmortem contains: "we spent 2 hours grepping through logs." Log aggregation queries require fragile regex patterns. Correlation between services requires manually matching timestamps in different log files. No log-based alerts exist (because alerts require parseable fields).
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Insufficient log volume | Insufficient log structure | More logs don't help if you can't query them efficiently |
| Application working fine | Issues invisible in unstructured logs | Add structured logging; errors that were invisible become quantifiable |
| Logging infrastructure insufficient | Log format insufficient | The infrastructure is fine; the format is unqueryable |
The Fix (Generic)¶
- Immediate: For the most critical service, switch to structured JSON logging: use a logging library (structlog for Python, zap for Go, winston for Node.js) that outputs JSON by default.
- Short-term: Add minimum required fields to all log lines:
timestamp,level,event,trace_id(for correlation), and domain-specific fields (order_id,user_id). - Long-term: Define an organization-wide log schema; enforce it with log validation in CI; create standard Loki/Elasticsearch queries for common incident patterns.
Real-World Examples¶
- Example 1: E-commerce incident: "how many orders failed in the last hour?" With unstructured logs: 2 engineers, 30 minutes, fragile awk command. With structured logs: single Kibana query, 10 seconds.
- Example 2: Authentication failures were invisible because the log message format had changed 6 months earlier. The alert was looking for "auth failed" but the new message was "authentication error." Structured logging with an
event: auth_failurefield doesn't break when message text changes.
War Story¶
3am page: payment service errors. "How bad is it?" We didn't know. Our logs were giant strings. We ran
grep -c "ERROR" app.log— 15,000 errors in the last hour. Were they all the same? Different? From one user or all users?grep "payment failed" | sort | uniq -c | sort -rn— took 4 minutes to run on 20GB. We made decisions based on vibes. Next week we switched to structlog (Python); every log line was JSON. Same incident 3 months later: Loki query in 2 seconds, breakdown by error type, affected user count, average duration. Resolved in 15 minutes instead of 90.
Cross-References¶
- Topic Packs: observability-deep-dive
- Footguns: observability-deep-dive/footguns.md — "Logs with no structure"
- Related Patterns: FP-043 (percentile blindness — complementary observability gap), FP-042 (missing absent alert — another monitoring blind spot)