Skip to content

Runbook: Log Pipeline Backpressure / Logs Not Appearing

Field Value
Domain Observability
Alert loki_ingester_blocks_per_chunk_sum > threshold or logs missing in Grafana for >5 minutes
Severity P2
Est. Resolution Time 20-40 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, access to logging namespace (Loki/Promtail/Fluentd), Grafana Loki data source access

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get pods -n logging && kubectl logs -n logging daemonset/promtail --tail=20 2>/dev/null || kubectl logs -n logging daemonset/fluentd --tail=20 2>/dev/null
If output shows: Shipper pods crashing or showing backpressure/429 errors → The log shipper is the bottleneck — continue to Step 3 If output shows: Shipper pods healthy but logs still missing → The problem is further down the pipeline (Loki or storage) — skip to Step 4

Step 1: Confirm Whether Logs Are Missing for All Services or Just One

Why: If only one service's logs are missing, the problem is likely that service's log format or labels. If all service logs are missing, the problem is the shared log pipeline — Loki, the shipper daemon, or storage.

# In Grafana, open Explore → select the Loki data source
# Query for logs from a known healthy service:
# {namespace="kube-system", app="coredns"}
# Then query for logs from the affected service:
# {namespace="<AFFECTED_NAMESPACE>"}

# Or query Loki directly via API
kubectl port-forward -n logging svc/loki 3100:3100 &
curl -s 'http://localhost:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={namespace="kube-system"}' \
  --data-urlencode 'start='"$(date -d '5 minutes ago' +%s%N)"'' \
  --data-urlencode 'end='"$(date +%s%N)"'' | python3 -m json.tool | grep '"status"'
Expected output:
"status": "success"
If the query returns results for kube-system but not the affected namespace, the problem is isolated. If neither namespace returns results, the whole pipeline is broken. If this fails: If port-forward fails, check the Loki service name: kubectl get svc -n logging | grep loki

Step 2: Check Whether Logs Are Appearing in Grafana at All

Why: Grafana may have a stale or misconfigured Loki data source. Verifying whether any label returns results isolates whether the problem is Grafana → Loki or Loki → storage.

# Port-forward to Grafana and open the Explore page
kubectl port-forward -n monitoring svc/grafana 3000:3000 &
# Navigate to Explore → select Loki data source → run: {job=~".+"}
# This query matches any log stream that has a job label

# Check Grafana's Loki data source config
curl -s -H "Authorization: Bearer <GRAFANA_API_KEY>" \
  http://localhost:3000/api/datasources | python3 -m json.tool | grep -A5 '"type": "loki"'
Expected output:
"type": "loki",
"url": "http://loki:3100",
"access": "proxy"
If this fails: If the Loki data source test fails in Grafana, the URL is wrong or Loki is not responding. Proceed to Step 4 to check Loki's health directly.

Step 3: Check Log Shipper (Promtail/Fluentd/Vector) Pod Status and Logs

Why: The log shipper (a DaemonSet running on every node) is responsible for collecting logs and forwarding them to Loki. If the shipper is crashing or logging errors, logs are being dropped at the source before they ever reach Loki.

# Check all log shipper pods
kubectl get pods -n logging -o wide

# Check DaemonSet rollout status
kubectl rollout status daemonset/promtail -n logging

# Get recent logs from the shipper — look for errors, backpressure, or 429/503 responses
kubectl logs -n logging daemonset/promtail --tail=100 | grep -E 'error|Error|ERRO|429|503|backpressure|dropped|failed'

# If using Fluentd instead:
kubectl logs -n logging daemonset/fluentd --tail=100 | grep -E 'error|Error|warn|Warn|retry|retry_count'

# Check if specific node shippers are failing
kubectl get pods -n logging -o wide | grep -v Running
Expected output (healthy Promtail):
level=info msg="Tailing new file" path=/var/log/pods/production_myapp-.../myapp/0.log
level=info msg="successfully sent entries" url=http://loki:3100
Backpressure indicators: level=warn msg="dropping entry", 429 Too Many Requests, channel is full, retry queue full. If this fails: If all DaemonSet pods are in CrashLoopBackOff, check the shipper configuration ConfigMap for syntax errors: kubectl describe configmap -n logging promtail-config

Step 4: Check Loki Ingester Status and Pod Health

Why: Loki's ingesters are the write path — they receive log streams from shippers and buffer them before writing to storage. If ingesters are overloaded, crashing, or restarting, logs queue up in the shippers and eventually get dropped.

# Check Loki pods
kubectl get pods -n logging -l app=loki -o wide

# Check Loki logs for ingestion errors or saturation
kubectl logs -n logging -l app=loki --tail=100 | grep -E 'error|Error|ERRO|ingester|ratelimit|compactor'

# Check Loki metrics via API (if available)
curl -s http://localhost:3100/metrics | grep -E 'loki_ingester_(blocks|chunks|streams|appended)' | head -20

# Check the ingester ring status (for distributed Loki)
curl -s http://localhost:3100/ring | python3 -m json.tool | grep '"state"' | sort | uniq -c
Expected output (healthy):
# All ingester pods Running
# Loki logs: no error lines
# Ring status: all ingesters "ACTIVE"
If this fails: If Loki pods are OOMKilled, the ingester is running out of memory. Check with kubectl describe pod -n logging <LOKI_POD> and look for OOMKilled in the Last State section. This requires scaling Loki or tuning ingester.chunk_target_size.

Step 5: Check Loki Disk Space or Object Storage Connectivity

Why: Loki writes WAL (Write-Ahead Log) to local disk and flushes chunks to object storage (S3, GCS). If either is full or unreachable, ingesters block and the whole write path stalls.

# Check disk space on Loki PersistentVolumes
kubectl get pvc -n logging
kubectl exec -n logging <LOKI_POD> -- df -h /data

# Check object storage connectivity (Loki will log errors like "failed to upload chunk")
kubectl logs -n logging -l app=loki --tail=200 | grep -E 'S3|GCS|azure|object.storage|upload|chunk.flush'

# Check if object storage credentials are still valid (look for 403/401 errors)
kubectl logs -n logging -l app=loki --tail=200 | grep -E '401|403|AccessDenied|Forbidden'

# Verify the object storage bucket name in the Loki config
kubectl get configmap -n logging loki-config -o yaml | grep -A5 'storage_config'
Expected output (healthy):
# PVC: ~50-70% used
# df: /data has free space
# No 401/403 errors in logs
# Last chunk flush: recent timestamp in logs
If this fails: If the PVC is nearly full, the immediate fix is to expand the PersistentVolume (if your storage class supports it) or adjust Loki's retention settings to reduce stored data.

Step 6: Check for Rate Limiting on Loki Ingestion

Why: Loki enforces per-tenant ingestion rate limits to protect the cluster. If a noisy application is logging heavily, it will hit the rate limit, receive 429 responses, and the shipper will back off or drop entries. This is intentional behaviour — the question is whether the limit needs raising or the application needs silencing.

# Look for 429 errors in the shipper logs
kubectl logs -n logging daemonset/promtail --tail=200 | grep '429'

# Check Loki's rate limit configuration
kubectl get configmap -n logging loki-config -o yaml | grep -E 'ingestion_rate|burst_size|max_streams'

# Check which tenant/namespace is hitting the limit most
curl -s http://localhost:3100/metrics | grep 'loki_discarded_samples_total' | sort -t= -k3 -rn | head -10
Expected output:
# loki_discarded_samples_total{reason="rate_limit_exceeded", tenant="production"} 15234
This tells you which tenant is being throttled. The fix is either: - Raise the rate limit for the tenant in Loki's limits_config - Reduce log verbosity in the noisy application - Add Promtail pipeline stages to filter or sample high-volume log streams If this fails: If Loki metrics are not exposed externally, exec into the Loki pod: kubectl exec -n logging <LOKI_POD> -- wget -qO- http://localhost:3100/metrics | grep discarded

Verification

# Confirm logs are flowing again — query for recent logs
curl -s 'http://localhost:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={namespace="<AFFECTED_NAMESPACE>"}' \
  --data-urlencode 'start='"$(date -d '2 minutes ago' +%s%N)"'' \
  --data-urlencode 'end='"$(date +%s%N)"'' | python3 -m json.tool | grep '"values"' | head -3
Success looks like: "values" array contains log entries with timestamps within the last 2 minutes. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min Observability / Platform team "Log pipeline broken: logs missing for >30 min; Loki ingester status: ; shipper errors: "
Data loss suspected Observability lead "Possible log data loss: Promtail has been dropping entries since due to backpressure; gap in audit trail for "
Scope expanding Platform team "Log pipeline backpressure affecting all namespaces; Loki ingesters may be overloaded or object storage is unreachable"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Add a Prometheus alert on loki_discarded_samples_total > 0 to detect rate limiting before users notice
  • Add a Prometheus alert on Promtail/Fluentd pod restarts in the logging namespace
  • Review and tune Loki's ingestion_rate_mb and ingestion_burst_size_mb per-tenant limits based on peak observed throughput
  • Consider adding Promtail pipeline stages to drop noisy debug/trace logs before they reach Loki

Common Mistakes

  1. Checking Grafana configuration when Loki itself is healthy: If Loki is responding correctly to API queries but Grafana shows no data, the problem is the Grafana data source configuration or a query syntax error — not the log pipeline. Test Loki directly first before touching Grafana.
  2. Not checking the log shipper (it is often the bottleneck): Operators often jump to Loki first, but Promtail or Fluentd on the nodes is the more common failure point. A single DaemonSet pod crashing on one node causes that node's logs to disappear silently.
  3. Forgetting that Loki rate limits by tenant: If one namespace is logging at 100 MB/s, it will exhaust the per-tenant rate limit and no further logs will be accepted. The fix is not to restart Loki — it is to silence the noisy tenant or raise the limit for it specifically.
  4. Assuming a log gap means logs are gone forever: Loki's ingesters buffer logs in memory before flushing to storage. If the ingester was overloaded but not crashed, logs already received may be flushed once pressure drops. Wait 5 minutes after fixing backpressure before concluding that logs are permanently lost.

Cross-References

  • Topic Pack: Loki Log Pipeline Architecture (deep background on Promtail pipeline stages, Loki ingesters, chunk flushing, and multi-tenancy)
  • Related Runbook: prometheus-target-down.md — log pipeline issues and metric scrape failures sometimes share the same network-level root cause
  • Related Runbook: grafana-blank.md — log gaps appear in Grafana Explore as blank time ranges, which can be confused with a Grafana data source issue

Wiki Navigation