Skip to content

Observability Debugging Decision Flow

Systematic approach to diagnosing metrics, logs, and traces issues.

START: Observability problem
  |
  +-- Metrics missing?  --> [1] PROMETHEUS
  +-- Logs missing?     --> [2] LOKI
  +-- Traces missing?   --> [3] TEMPO
  +-- Dashboard empty?  --> [4] GRAFANA
  |
  v

[1] PROMETHEUS TARGETS
  kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus
  kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
  # Open http://localhost:9090/targets
  |
  +-- Prometheus pod not running? --> check monitoring namespace, helm release
  +-- Target shows DOWN?          --> [1a] SERVICE DISCOVERY
  +-- Target not listed?          --> [1b] SERVICEMONITOR
  |
  v

[1a] SERVICE DISCOVERY (target DOWN)
  kubectl get endpoints -n <ns> <svc>
  kubectl exec -n <ns> <pod> -- wget -qO- http://localhost:<port>/metrics
  |
  +-- No endpoints?      --> pod not ready, check readiness probe
  +-- /metrics not found? --> app not exposing metrics endpoint
  +-- /metrics works?    --> check port name in ServiceMonitor
  See: training/library/runbooks/prometheus_target_down.md

[1b] SERVICEMONITOR MATCHING
  kubectl get servicemonitor -n <ns> -o yaml
  kubectl get svc -n <ns> --show-labels
  |
  Compare: ServiceMonitor.spec.selector.matchLabels vs Service labels
  +-- Labels don't match? --> fix ServiceMonitor selector or Service labels
  +-- Namespace wrong?    --> check serviceMonitorNamespaceSelector on Prometheus
  See: training/interactive/runtime-labs/lab-runtime-03-observability-target-down/

[2] LOKI LOG PIPELINE
  App --> stdout --> Promtail (DaemonSet) --> Loki --> Grafana
  |
  [2a] Is app producing logs?
  kubectl logs -n <ns> deploy/<name>
  +-- No output? --> app issue, not observability
  +-- Has output --> [2b]
  |
  [2b] Is Promtail running on app's node?
  kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail -o wide
  kubectl get pods -n <ns> -o wide   # compare nodes
  +-- Promtail not on node? --> check DaemonSet nodeSelector, tolerations
  +-- Promtail crashed?     --> kubectl logs -n monitoring <promtail-pod>
  |
  [2c] Can Promtail reach Loki?
  kubectl logs -n monitoring <promtail-pod> | grep -i error
  kubectl port-forward -n monitoring svc/loki 3100:3100
  curl http://localhost:3100/ready
  +-- Loki not ready? --> check Loki pod, storage
  +-- Connection refused? --> check Loki service/endpoint
  |
  [2d] Label pipeline correct?
  kubectl get pods -n monitoring <promtail-pod> -o yaml | grep -A20 pipelineStages
  See: training/library/runbooks/observability/loki_no_logs.md
       training/interactive/runtime-labs/lab-runtime-04-loki-no-logs/

[3] TEMPO TRACES
  [3a] Is Tempo running?
  kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
  kubectl port-forward -n monitoring svc/tempo 3200:3200
  curl http://localhost:3200/ready
  +-- Not ready? --> check Tempo pod logs, storage
  |
  [3b] Is app instrumented?
  kubectl get deploy -n <ns> <name> -o yaml | grep OTEL
  +-- No OTEL env vars? --> app needs OpenTelemetry instrumentation
  +-- Wrong endpoint?   --> fix OTEL_EXPORTER_OTLP_ENDPOINT
  |
  [3c] Is Grafana data source configured?
  # Grafana > Configuration > Data Sources > Tempo
  +-- Missing data source? --> add Tempo data source pointing to http://tempo:3200
  See: training/library/runbooks/observability/tempo_no_traces.md

[4] GRAFANA
  kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
  # Open http://localhost:3000 (admin/prom-operator)
  |
  +-- Can't connect?     --> check Grafana pod status
  +-- Dashboard empty?   --> check data source configuration
  +-- Query returns nothing? --> go to specific pipeline ([1], [2], or [3])

Pipeline Summary

Signal Producer Collector Store Query
Metrics App /metrics Prometheus scrape Prometheus TSDB PromQL in Grafana
Logs App stdout Promtail DaemonSet Loki LogQL in Grafana
Traces App OTLP SDK OTLP receiver Tempo TraceQL in Grafana

Key Ports

Service Port Forward Command
Prometheus 9090 kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
Grafana 3000 kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
Loki 3100 kubectl port-forward -n monitoring svc/loki 3100:3100
Tempo 3200 kubectl port-forward -n monitoring svc/tempo 3200:3200

See also: - devops/docs/observability.md - training/library/skillchecks/observability.skillcheck.md