- observability
- l2
- cheat-sheet
- prometheus
- grafana
- loki
- tempo --- Portal | Level: L2: Operations | Topics: Prometheus, Grafana, Loki, Tempo | Domain: Observability
Observability Architecture¶
Overview¶
The GrokDevOps observability stack provides metrics, logging, and tracing using open-source tools deployed on Kubernetes.
Stack Components¶
| Component | Role | Port |
|---|---|---|
| Prometheus | Metrics collection and storage | 9090 |
| Grafana | Dashboards and visualization | 3000 |
| Alertmanager | Alert routing | 9093 |
| Loki | Log aggregation | 3100 |
| Promtail | Log collection (DaemonSet) | - |
| Tempo | Distributed tracing | 3200 |
Helm Releases¶
All components are installed using curated Helm values files from devops/observability/values/.
| Release Name | Chart | Values File |
|---|---|---|
kube-prometheus-stack |
prometheus-community/kube-prometheus-stack |
values-prometheus.yaml |
loki |
grafana/loki |
values-loki.yaml |
promtail |
grafana/promtail |
values-promtail.yaml |
tempo |
grafana/tempo |
values-tempo.yaml |
Data Flow¶
Metrics: Application -> Prometheus -> Grafana¶
FastAPI (/metrics)
|
v
ServiceMonitor (selects app by labels)
|
v
Prometheus (scrapes every 30s)
|
v
Grafana (dashboards, alerts)
The application exposes a /metrics endpoint using prometheus-client. Metrics include:
http_requests_total— counter with labels: method, endpoint, statushttp_request_duration_seconds— histogram with labels: method, endpoint
The Helm chart includes a ServiceMonitor template (gated by monitoring.serviceMonitor.enabled) that tells Prometheus how to scrape the application.
Logging: Application -> Promtail -> Loki -> Grafana¶
FastAPI (stdout/stderr)
|
v
Promtail (DaemonSet, reads container logs)
|
v
Loki (stores and indexes by labels)
|
v
Grafana (LogQL queries)
Promtail runs as a DaemonSet on every node. It tails container log files from /var/log/pods and ships them to Loki with Kubernetes metadata labels (namespace, pod, container).
Tracing: Application -> Tempo -> Grafana¶
Tempo is deployed and ready. Application-side OpenTelemetry instrumentation is a future addition.
Installation¶
There is a single canonical installer:
# Install the full observability stack
./devops/scripts/install-observability.sh
# Or via Ansible
cd devops/ansible
ansible-playbook playbooks/install-addons.yml
Both paths use the same values files and Helm release names.
Verification¶
Check all pods are running¶
Check Prometheus targets¶
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets
# Look for serviceMonitor/grokdevops/grokdevops
Query metrics in Grafana¶
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Open http://localhost:3000 (admin/admin)
# Explore -> Prometheus -> http_requests_total
Query logs in Grafana¶
Uninstall¶
Values Files Reference¶
| File | Chart | Purpose |
|---|---|---|
devops/observability/values/values-prometheus.yaml |
kube-prometheus-stack | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics |
devops/observability/values/values-loki.yaml |
grafana/loki | Loki log aggregation (single-binary mode) |
devops/observability/values/values-promtail.yaml |
grafana/promtail | Promtail log collection DaemonSet |
devops/observability/values/values-tempo.yaml |
grafana/tempo | Tempo distributed tracing backend |
ServiceMonitor¶
The preferred approach is the Helm-managed ServiceMonitor. Enable it in your values file:
This is already enabled in values-dev.yaml, values-staging.yaml, and values-prod.yaml.
A legacy standalone manifest exists at devops/k8s/monitoring/servicemonitor.legacy.yaml for environments not using the Helm chart.
HPA and Metrics Server¶
Production values (values-prod.yaml) enable a HorizontalPodAutoscaler. The HPA requires metrics-server to read CPU utilization. Without metrics-server, the HPA will not scale.
Install metrics-server on k3s:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Wiki Navigation¶
Related Content¶
- Observability Deep Dive (Topic Pack, L2) — Grafana, Loki, Prometheus
- Skillcheck: Observability (Assessment, L2) — Grafana, Loki, Prometheus
- Track: Observability (Reference, L2) — Grafana, Loki, Prometheus
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Loki, Prometheus
- Lab: Prometheus Target Down (CLI) (Lab, L2) — Grafana, Prometheus
- Monitoring Fundamentals (Topic Pack, L1) — Grafana, Prometheus
- Monitoring Migration (Legacy to Modern) (Topic Pack, L2) — Grafana, Prometheus
- Observability Drills (Drill, L2) — Loki, Prometheus
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
Pages that link here¶
- GrokDevOps - DevOps Learning Roadmap
- Level 4: Operations & Observability
- Log Pipelines - Primer
- LogQL Drills
- Monitoring Fundamentals
- Monitoring Fundamentals - Primer
- Monitoring Migration (Legacy to Modern)
- Observability
- Observability Domain
- Observability Drills
- Observability Skillcheck
- Primer
- Primer
- Runbook: Grafana Dashboard Blank / No Data
- Runbook: Log Pipeline Backpressure / Logs Not Appearing