Skip to content

Portal | Level: L2: Operations | Topics: Loki | Domain: Observability

Scenario: Logs Disappeared from Grafana Loki

The Prompt

"Our developers report that application logs stopped appearing in Grafana about 30 minutes ago. The app is running and writing to stdout. Prometheus metrics are fine. What would you investigate?"

Initial Report

Developer ticket: "We can't see any application logs in Grafana Explore since about 2:00 PM. We need logs to debug a customer-reported bug. Prometheus metrics still work. This is blocking our investigation."

Constraints

  • Time pressure: You have 15 minutes before the next escalation. The dev team is blocked on a P1 customer bug.
  • Limited access: You can view pods in the monitoring namespace but cannot modify Promtail's DaemonSet without platform team approval. No SSH access to nodes.

Observable Evidence

  • Dashboard: Grafana Explore with LogQL queries returns zero results for the last 30 minutes. Older logs still appear.
  • Promtail: One Promtail pod may be in CrashLoopBackOff, or its logs show msg="error sending batch" status=429 indicating Loki is rejecting writes.
  • Logs: kubectl logs deploy/grokdevops produces output, confirming the app writes to stdout normally.

Expected Investigation Path

# 1. Confirm app is producing logs
kubectl logs -n grokdevops deploy/grokdevops --tail=5

# 2. Check Promtail (the log collector)
kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail
kubectl logs -n monitoring -l app.kubernetes.io/name=promtail --tail=20

# 3. Check Loki health
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
kubectl port-forward -n monitoring svc/loki 3100:3100 &
curl http://localhost:3100/ready

# 4. Check if Promtail is scheduled on the same node as the app
kubectl get pods -n grokdevops -o wide
kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail -o wide

Strong Answer

"The log pipeline here is: app stdout → Promtail (DaemonSet on each node) → Loki → Grafana. Since metrics work fine, it's not a general cluster issue. I'd investigate the Promtail → Loki path. First, I'd confirm Promtail pods are running on every node — as a DaemonSet, each node needs one. If a Promtail pod is missing from the node running our app, that explains the gap. Common causes: someone patched the DaemonSet with a bad nodeSelector, the pod was evicted due to resource pressure, or a toleration was removed. I'd also check Promtail logs for connection errors to Loki, and verify Loki itself is healthy and not out of storage."

Common Traps

  • Blaming the application — the app writes to stdout fine, it's the collection pipeline
  • Not knowing Promtail is a DaemonSet — if it's not on the right node, no logs
  • Forgetting Loki storage — Loki can silently drop logs if storage is full
  • Not distinguishing historical vs new logs — old logs might still be queryable even if new ones aren't flowing
  • Lab: training/interactive/runtime-labs/lab-runtime-04-loki-no-logs/
  • Runbook: training/library/runbooks/observability/loki_no_logs.md
  • Quest: training/interactive/exercises/levels/level-50/k8s-monitoring/

Wiki Navigation