Observability Debugging Decision Flow¶
Systematic approach to diagnosing metrics, logs, and traces issues.
START: Observability problem
|
+-- Metrics missing? --> [1] PROMETHEUS
+-- Logs missing? --> [2] LOKI
+-- Traces missing? --> [3] TEMPO
+-- Dashboard empty? --> [4] GRAFANA
|
v
[1] PROMETHEUS TARGETS
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets
|
+-- Prometheus pod not running? --> check monitoring namespace, helm release
+-- Target shows DOWN? --> [1a] SERVICE DISCOVERY
+-- Target not listed? --> [1b] SERVICEMONITOR
|
v
[1a] SERVICE DISCOVERY (target DOWN)
kubectl get endpoints -n <ns> <svc>
kubectl exec -n <ns> <pod> -- wget -qO- http://localhost:<port>/metrics
|
+-- No endpoints? --> pod not ready, check readiness probe
+-- /metrics not found? --> app not exposing metrics endpoint
+-- /metrics works? --> check port name in ServiceMonitor
See: training/library/runbooks/prometheus_target_down.md
[1b] SERVICEMONITOR MATCHING
kubectl get servicemonitor -n <ns> -o yaml
kubectl get svc -n <ns> --show-labels
|
Compare: ServiceMonitor.spec.selector.matchLabels vs Service labels
+-- Labels don't match? --> fix ServiceMonitor selector or Service labels
+-- Namespace wrong? --> check serviceMonitorNamespaceSelector on Prometheus
See: training/interactive/runtime-labs/lab-runtime-03-observability-target-down/
[2] LOKI LOG PIPELINE
App --> stdout --> Promtail (DaemonSet) --> Loki --> Grafana
|
[2a] Is app producing logs?
kubectl logs -n <ns> deploy/<name>
+-- No output? --> app issue, not observability
+-- Has output --> [2b]
|
[2b] Is Promtail running on app's node?
kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail -o wide
kubectl get pods -n <ns> -o wide # compare nodes
+-- Promtail not on node? --> check DaemonSet nodeSelector, tolerations
+-- Promtail crashed? --> kubectl logs -n monitoring <promtail-pod>
|
[2c] Can Promtail reach Loki?
kubectl logs -n monitoring <promtail-pod> | grep -i error
kubectl port-forward -n monitoring svc/loki 3100:3100
curl http://localhost:3100/ready
+-- Loki not ready? --> check Loki pod, storage
+-- Connection refused? --> check Loki service/endpoint
|
[2d] Label pipeline correct?
kubectl get pods -n monitoring <promtail-pod> -o yaml | grep -A20 pipelineStages
See: training/library/runbooks/observability/loki_no_logs.md
training/interactive/runtime-labs/lab-runtime-04-loki-no-logs/
[3] TEMPO TRACES
[3a] Is Tempo running?
kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
kubectl port-forward -n monitoring svc/tempo 3200:3200
curl http://localhost:3200/ready
+-- Not ready? --> check Tempo pod logs, storage
|
[3b] Is app instrumented?
kubectl get deploy -n <ns> <name> -o yaml | grep OTEL
+-- No OTEL env vars? --> app needs OpenTelemetry instrumentation
+-- Wrong endpoint? --> fix OTEL_EXPORTER_OTLP_ENDPOINT
|
[3c] Is Grafana data source configured?
# Grafana > Configuration > Data Sources > Tempo
+-- Missing data source? --> add Tempo data source pointing to http://tempo:3200
See: training/library/runbooks/observability/tempo_no_traces.md
[4] GRAFANA
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Open http://localhost:3000 (admin/prom-operator)
|
+-- Can't connect? --> check Grafana pod status
+-- Dashboard empty? --> check data source configuration
+-- Query returns nothing? --> go to specific pipeline ([1], [2], or [3])
Pipeline Summary¶
| Signal | Producer | Collector | Store | Query |
|---|---|---|---|---|
| Metrics | App /metrics | Prometheus scrape | Prometheus TSDB | PromQL in Grafana |
| Logs | App stdout | Promtail DaemonSet | Loki | LogQL in Grafana |
| Traces | App OTLP SDK | OTLP receiver | Tempo | TraceQL in Grafana |
Key Ports¶
| Service | Port | Forward Command |
|---|---|---|
| Prometheus | 9090 | kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 |
| Grafana | 3000 | kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80 |
| Loki | 3100 | kubectl port-forward -n monitoring svc/loki 3100:3100 |
| Tempo | 3200 | kubectl port-forward -n monitoring svc/tempo 3200:3200 |
See also: - devops/docs/observability.md - training/library/skillchecks/observability.skillcheck.md