Ops Archaeology: The Alerts That Stopped Firing¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L2 Estimated time: 25 min Domains: Prometheus, Alertmanager, Grafana, Observability
Artifact 1: CLI Output¶
$ kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
prometheus-server-0 2/2 Running 0 34d
alertmanager-0 1/1 Running 0 34d
grafana-7f8d9c4b56-k2m4n 1/1 Running 0 34d
node-exporter-4h7jk 1/1 Running 0 34d
node-exporter-8m2np 1/1 Running 0 34d
node-exporter-q5r9t 1/1 Running 0 34d
kube-state-metrics-5d6f7a8b43-v3w5x 1/1 Running 0 34d
$ kubectl exec -n monitoring prometheus-server-0 -c prometheus -- promtool query instant http://localhost:9090 'up{job="api-server"}'
up{instance="api-server-1:8080", job="api-server"} => 1 @[1733842800]
up{instance="api-server-2:8080", job="api-server"} => 1 @[1733842800]
up{instance="api-server-3:8080", job="api-server"} => 1 @[1733842800]
$ kubectl exec -n monitoring prometheus-server-0 -c prometheus -- promtool query instant http://localhost:9090 'prometheus_target_scrape_pool_sync_total{scrape_pool="api-server"}'
prometheus_target_scrape_pool_sync_total{scrape_pool="api-server"} => 48 @[1733842800]
Artifact 2: Metrics¶
# Prometheus self-monitoring metrics
# Scrape duration (how long each scrape takes)
prometheus_target_scrape_pool_target_limit{scrape_pool="api-server"} 0
scrape_duration_seconds{job="api-server",instance="api-server-1:8080",quantile="0.99"} 0.023
# Scrape interval info (from config)
prometheus_target_interval_length_seconds{interval="5m0s",quantile="0.99"} 300.12
# Rule evaluation
prometheus_rule_group_last_duration_seconds{rule_group="api-server.rules"} 0
prometheus_rule_group_last_evaluation_timestamp_seconds{rule_group="api-server.rules"} 0
# Alertmanager notification log
alertmanager_notifications_total{integration="slack"} 847
alertmanager_notifications_failed_total{integration="slack"} 0
Artifact 3: Infrastructure Code¶
# From: helm/prometheus-values.yaml (recently modified section)
serverFiles:
prometheus.yml:
scrape_configs:
- job_name: 'api-server'
scrape_interval: 5m
scrape_timeout: 30s
static_configs:
- targets:
- 'api-server-1:8080'
- 'api-server-2:8080'
- 'api-server-3:8080'
rule_files:
- /etc/prometheus/rules/*.yaml
# NOTE: rule_files key was moved during a config refactor on Dec 1
# Previously it was under 'serverFiles.alerting.rules'
Artifact 4: Log Lines¶
[2024-12-10T12:00:05Z] prometheus | level=info msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml
[2024-12-10T12:00:05Z] prometheus | level=warn msg="No rule files found matching pattern" pattern=/etc/prometheus/rules/*.yaml
[2024-12-01T14:30:11Z] alertmanager | level=info msg="No alerts to notify about" receiver=slack
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?