Observability Debugging Decision Flow¶

Systematic approach to diagnosing metrics, logs, and traces issues.

START: Observability problem
  |
  +-- Metrics missing?  --> [1] PROMETHEUS
  +-- Logs missing?     --> [2] LOKI
  +-- Traces missing?   --> [3] TEMPO
  +-- Dashboard empty?  --> [4] GRAFANA
  |
  v

[1] PROMETHEUS TARGETS
  kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus
  kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
  # Open http://localhost:9090/targets
  |
  +-- Prometheus pod not running? --> check monitoring namespace, helm release
  +-- Target shows DOWN?          --> [1a] SERVICE DISCOVERY
  +-- Target not listed?          --> [1b] SERVICEMONITOR
  |
  v

[1a] SERVICE DISCOVERY (target DOWN)
  kubectl get endpoints -n <ns> <svc>
  kubectl exec -n <ns> <pod> -- wget -qO- http://localhost:<port>/metrics
  |
  +-- No endpoints?      --> pod not ready, check readiness probe
  +-- /metrics not found? --> app not exposing metrics endpoint
  +-- /metrics works?    --> check port name in ServiceMonitor
  See: training/library/runbooks/prometheus_target_down.md

[1b] SERVICEMONITOR MATCHING
  kubectl get servicemonitor -n <ns> -o yaml
  kubectl get svc -n <ns> --show-labels
  |
  Compare: ServiceMonitor.spec.selector.matchLabels vs Service labels
  +-- Labels don't match? --> fix ServiceMonitor selector or Service labels
  +-- Namespace wrong?    --> check serviceMonitorNamespaceSelector on Prometheus
  See: training/interactive/runtime-labs/lab-runtime-03-observability-target-down/

[2] LOKI LOG PIPELINE
  App --> stdout --> Promtail (DaemonSet) --> Loki --> Grafana
  |
  [2a] Is app producing logs?
  kubectl logs -n <ns> deploy/<name>
  +-- No output? --> app issue, not observability
  +-- Has output --> [2b]
  |
  [2b] Is Promtail running on app's node?
  kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail -o wide
  kubectl get pods -n <ns> -o wide   # compare nodes
  +-- Promtail not on node? --> check DaemonSet nodeSelector, tolerations
  +-- Promtail crashed?     --> kubectl logs -n monitoring <promtail-pod>
  |
  [2c] Can Promtail reach Loki?
  kubectl logs -n monitoring <promtail-pod> | grep -i error
  kubectl port-forward -n monitoring svc/loki 3100:3100
  curl http://localhost:3100/ready
  +-- Loki not ready? --> check Loki pod, storage
  +-- Connection refused? --> check Loki service/endpoint
  |
  [2d] Label pipeline correct?
  kubectl get pods -n monitoring <promtail-pod> -o yaml | grep -A20 pipelineStages
  See: training/library/runbooks/observability/loki_no_logs.md
       training/interactive/runtime-labs/lab-runtime-04-loki-no-logs/

[3] TEMPO TRACES
  [3a] Is Tempo running?
  kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
  kubectl port-forward -n monitoring svc/tempo 3200:3200
  curl http://localhost:3200/ready
  +-- Not ready? --> check Tempo pod logs, storage
  |
  [3b] Is app instrumented?
  kubectl get deploy -n <ns> <name> -o yaml | grep OTEL
  +-- No OTEL env vars? --> app needs OpenTelemetry instrumentation
  +-- Wrong endpoint?   --> fix OTEL_EXPORTER_OTLP_ENDPOINT
  |
  [3c] Is Grafana data source configured?
  # Grafana > Configuration > Data Sources > Tempo
  +-- Missing data source? --> add Tempo data source pointing to http://tempo:3200
  See: training/library/runbooks/observability/tempo_no_traces.md

[4] GRAFANA
  kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
  # Open http://localhost:3000 (admin/prom-operator)
  |
  +-- Can't connect?     --> check Grafana pod status
  +-- Dashboard empty?   --> check data source configuration
  +-- Query returns nothing? --> go to specific pipeline ([1], [2], or [3])

Pipeline Summary¶

Signal	Producer	Collector	Store	Query
Metrics	App /metrics	Prometheus scrape	Prometheus TSDB	PromQL in Grafana
Logs	App stdout	Promtail DaemonSet	Loki	LogQL in Grafana
Traces	App OTLP SDK	OTLP receiver	Tempo	TraceQL in Grafana

Key Ports¶

Service	Port	Forward Command
Prometheus	9090	`kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090`
Grafana	3000	`kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80`
Loki	3100	`kubectl port-forward -n monitoring svc/loki 3100:3100`
Tempo	3200	`kubectl port-forward -n monitoring svc/tempo 3200:3200`

Observability Debugging Decision Flow¶

Pipeline Summary¶

Key Ports¶

Pages that link here¶