- observability
- l2
- runbook
- log-pipelines
- loki --- Portal | Level: L2: Operations | Topics: Log Pipelines, Loki | Domain: Observability
Runbook: Log Pipeline Backpressure / Logs Not Appearing¶
| Field | Value |
|---|---|
| Domain | Observability |
| Alert | loki_ingester_blocks_per_chunk_sum > threshold or logs missing in Grafana for >5 minutes |
| Severity | P2 |
| Est. Resolution Time | 20-40 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, access to logging namespace (Loki/Promtail/Fluentd), Grafana Loki data source access |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl get pods -n logging && kubectl logs -n logging daemonset/promtail --tail=20 2>/dev/null || kubectl logs -n logging daemonset/fluentd --tail=20 2>/dev/null
Step 1: Confirm Whether Logs Are Missing for All Services or Just One¶
Why: If only one service's logs are missing, the problem is likely that service's log format or labels. If all service logs are missing, the problem is the shared log pipeline — Loki, the shipper daemon, or storage.
# In Grafana, open Explore → select the Loki data source
# Query for logs from a known healthy service:
# {namespace="kube-system", app="coredns"}
# Then query for logs from the affected service:
# {namespace="<AFFECTED_NAMESPACE>"}
# Or query Loki directly via API
kubectl port-forward -n logging svc/loki 3100:3100 &
curl -s 'http://localhost:3100/loki/api/v1/query_range' \
--data-urlencode 'query={namespace="kube-system"}' \
--data-urlencode 'start='"$(date -d '5 minutes ago' +%s%N)"'' \
--data-urlencode 'end='"$(date +%s%N)"'' | python3 -m json.tool | grep '"status"'
kube-system but not the affected namespace, the problem is isolated. If neither namespace returns results, the whole pipeline is broken.
If this fails: If port-forward fails, check the Loki service name: kubectl get svc -n logging | grep loki
Step 2: Check Whether Logs Are Appearing in Grafana at All¶
Why: Grafana may have a stale or misconfigured Loki data source. Verifying whether any label returns results isolates whether the problem is Grafana → Loki or Loki → storage.
# Port-forward to Grafana and open the Explore page
kubectl port-forward -n monitoring svc/grafana 3000:3000 &
# Navigate to Explore → select Loki data source → run: {job=~".+"}
# This query matches any log stream that has a job label
# Check Grafana's Loki data source config
curl -s -H "Authorization: Bearer <GRAFANA_API_KEY>" \
http://localhost:3000/api/datasources | python3 -m json.tool | grep -A5 '"type": "loki"'
Step 3: Check Log Shipper (Promtail/Fluentd/Vector) Pod Status and Logs¶
Why: The log shipper (a DaemonSet running on every node) is responsible for collecting logs and forwarding them to Loki. If the shipper is crashing or logging errors, logs are being dropped at the source before they ever reach Loki.
# Check all log shipper pods
kubectl get pods -n logging -o wide
# Check DaemonSet rollout status
kubectl rollout status daemonset/promtail -n logging
# Get recent logs from the shipper — look for errors, backpressure, or 429/503 responses
kubectl logs -n logging daemonset/promtail --tail=100 | grep -E 'error|Error|ERRO|429|503|backpressure|dropped|failed'
# If using Fluentd instead:
kubectl logs -n logging daemonset/fluentd --tail=100 | grep -E 'error|Error|warn|Warn|retry|retry_count'
# Check if specific node shippers are failing
kubectl get pods -n logging -o wide | grep -v Running
level=info msg="Tailing new file" path=/var/log/pods/production_myapp-.../myapp/0.log
level=info msg="successfully sent entries" url=http://loki:3100
level=warn msg="dropping entry", 429 Too Many Requests, channel is full, retry queue full.
If this fails: If all DaemonSet pods are in CrashLoopBackOff, check the shipper configuration ConfigMap for syntax errors: kubectl describe configmap -n logging promtail-config
Step 4: Check Loki Ingester Status and Pod Health¶
Why: Loki's ingesters are the write path — they receive log streams from shippers and buffer them before writing to storage. If ingesters are overloaded, crashing, or restarting, logs queue up in the shippers and eventually get dropped.
# Check Loki pods
kubectl get pods -n logging -l app=loki -o wide
# Check Loki logs for ingestion errors or saturation
kubectl logs -n logging -l app=loki --tail=100 | grep -E 'error|Error|ERRO|ingester|ratelimit|compactor'
# Check Loki metrics via API (if available)
curl -s http://localhost:3100/metrics | grep -E 'loki_ingester_(blocks|chunks|streams|appended)' | head -20
# Check the ingester ring status (for distributed Loki)
curl -s http://localhost:3100/ring | python3 -m json.tool | grep '"state"' | sort | uniq -c
kubectl describe pod -n logging <LOKI_POD> and look for OOMKilled in the Last State section. This requires scaling Loki or tuning ingester.chunk_target_size.
Step 5: Check Loki Disk Space or Object Storage Connectivity¶
Why: Loki writes WAL (Write-Ahead Log) to local disk and flushes chunks to object storage (S3, GCS). If either is full or unreachable, ingesters block and the whole write path stalls.
# Check disk space on Loki PersistentVolumes
kubectl get pvc -n logging
kubectl exec -n logging <LOKI_POD> -- df -h /data
# Check object storage connectivity (Loki will log errors like "failed to upload chunk")
kubectl logs -n logging -l app=loki --tail=200 | grep -E 'S3|GCS|azure|object.storage|upload|chunk.flush'
# Check if object storage credentials are still valid (look for 403/401 errors)
kubectl logs -n logging -l app=loki --tail=200 | grep -E '401|403|AccessDenied|Forbidden'
# Verify the object storage bucket name in the Loki config
kubectl get configmap -n logging loki-config -o yaml | grep -A5 'storage_config'
# PVC: ~50-70% used
# df: /data has free space
# No 401/403 errors in logs
# Last chunk flush: recent timestamp in logs
Step 6: Check for Rate Limiting on Loki Ingestion¶
Why: Loki enforces per-tenant ingestion rate limits to protect the cluster. If a noisy application is logging heavily, it will hit the rate limit, receive 429 responses, and the shipper will back off or drop entries. This is intentional behaviour — the question is whether the limit needs raising or the application needs silencing.
# Look for 429 errors in the shipper logs
kubectl logs -n logging daemonset/promtail --tail=200 | grep '429'
# Check Loki's rate limit configuration
kubectl get configmap -n logging loki-config -o yaml | grep -E 'ingestion_rate|burst_size|max_streams'
# Check which tenant/namespace is hitting the limit most
curl -s http://localhost:3100/metrics | grep 'loki_discarded_samples_total' | sort -t= -k3 -rn | head -10
limits_config
- Reduce log verbosity in the noisy application
- Add Promtail pipeline stages to filter or sample high-volume log streams
If this fails: If Loki metrics are not exposed externally, exec into the Loki pod: kubectl exec -n logging <LOKI_POD> -- wget -qO- http://localhost:3100/metrics | grep discarded
Verification¶
# Confirm logs are flowing again — query for recent logs
curl -s 'http://localhost:3100/loki/api/v1/query_range' \
--data-urlencode 'query={namespace="<AFFECTED_NAMESPACE>"}' \
--data-urlencode 'start='"$(date -d '2 minutes ago' +%s%N)"'' \
--data-urlencode 'end='"$(date +%s%N)"'' | python3 -m json.tool | grep '"values"' | head -3
"values" array contains log entries with timestamps within the last 2 minutes.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | Observability / Platform team | "Log pipeline broken: logs missing for >30 min; Loki ingester status: |
| Data loss suspected | Observability lead | "Possible log data loss: Promtail has been dropping entries since |
| Scope expanding | Platform team | "Log pipeline backpressure affecting all namespaces; Loki ingesters may be overloaded or object storage is unreachable" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Add a Prometheus alert on
loki_discarded_samples_total > 0to detect rate limiting before users notice - Add a Prometheus alert on Promtail/Fluentd pod restarts in the logging namespace
- Review and tune Loki's
ingestion_rate_mbandingestion_burst_size_mbper-tenant limits based on peak observed throughput - Consider adding Promtail pipeline stages to drop noisy debug/trace logs before they reach Loki
Common Mistakes¶
- Checking Grafana configuration when Loki itself is healthy: If Loki is responding correctly to API queries but Grafana shows no data, the problem is the Grafana data source configuration or a query syntax error — not the log pipeline. Test Loki directly first before touching Grafana.
- Not checking the log shipper (it is often the bottleneck): Operators often jump to Loki first, but Promtail or Fluentd on the nodes is the more common failure point. A single DaemonSet pod crashing on one node causes that node's logs to disappear silently.
- Forgetting that Loki rate limits by tenant: If one namespace is logging at 100 MB/s, it will exhaust the per-tenant rate limit and no further logs will be accepted. The fix is not to restart Loki — it is to silence the noisy tenant or raise the limit for it specifically.
- Assuming a log gap means logs are gone forever: Loki's ingesters buffer logs in memory before flushing to storage. If the ingester was overloaded but not crashed, logs already received may be flushed once pressure drops. Wait 5 minutes after fixing backpressure before concluding that logs are permanently lost.
Cross-References¶
- Topic Pack: Loki Log Pipeline Architecture (deep background on Promtail pipeline stages, Loki ingesters, chunk flushing, and multi-tenancy)
- Related Runbook: prometheus-target-down.md — log pipeline issues and metric scrape failures sometimes share the same network-level root cause
- Related Runbook: grafana-blank.md — log gaps appear in Grafana Explore as blank time ranges, which can be confused with a Grafana data source issue
Wiki Navigation¶
Related Content¶
- Log Pipelines (Topic Pack, L2) — Log Pipelines, Loki
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Loki
- Interview: Loki Logs Disappeared (Scenario, L2) — Loki
- Lab: Loki No Logs (CLI) (Lab, L2) — Loki
- Log Pipelines Flashcards (CLI) (flashcard_deck, L1) — Log Pipelines
- LogQL Drills (Drill, L2) — Loki
- Loki Flashcards (CLI) (flashcard_deck, L1) — Loki
- Observability Architecture (Reference, L2) — Loki
- Observability Deep Dive (Topic Pack, L2) — Loki
- Observability Drills (Drill, L2) — Loki