Ops Archaeology: The Alerts That Stopped Firing¶

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L2 Estimated time: 25 min Domains: Prometheus, Alertmanager, Grafana, Observability

Artifact 1: CLI Output¶

$ kubectl get pods -n monitoring
NAME                                    READY   STATUS    RESTARTS   AGE
prometheus-server-0                     2/2     Running   0          34d
alertmanager-0                          1/1     Running   0          34d
grafana-7f8d9c4b56-k2m4n               1/1     Running   0          34d
node-exporter-4h7jk                     1/1     Running   0          34d
node-exporter-8m2np                     1/1     Running   0          34d
node-exporter-q5r9t                     1/1     Running   0          34d
kube-state-metrics-5d6f7a8b43-v3w5x    1/1     Running   0          34d

$ kubectl exec -n monitoring prometheus-server-0 -c prometheus -- promtool query instant http://localhost:9090 'up{job="api-server"}'
up{instance="api-server-1:8080", job="api-server"} => 1 @[1733842800]
up{instance="api-server-2:8080", job="api-server"} => 1 @[1733842800]
up{instance="api-server-3:8080", job="api-server"} => 1 @[1733842800]

$ kubectl exec -n monitoring prometheus-server-0 -c prometheus -- promtool query instant http://localhost:9090 'prometheus_target_scrape_pool_sync_total{scrape_pool="api-server"}'
prometheus_target_scrape_pool_sync_total{scrape_pool="api-server"} => 48 @[1733842800]

Artifact 2: Metrics¶

# Prometheus self-monitoring metrics

# Scrape duration (how long each scrape takes)
prometheus_target_scrape_pool_target_limit{scrape_pool="api-server"} 0
scrape_duration_seconds{job="api-server",instance="api-server-1:8080",quantile="0.99"} 0.023

# Scrape interval info (from config)
prometheus_target_interval_length_seconds{interval="5m0s",quantile="0.99"} 300.12

# Rule evaluation
prometheus_rule_group_last_duration_seconds{rule_group="api-server.rules"} 0
prometheus_rule_group_last_evaluation_timestamp_seconds{rule_group="api-server.rules"} 0

# Alertmanager notification log
alertmanager_notifications_total{integration="slack"} 847
alertmanager_notifications_failed_total{integration="slack"} 0

Artifact 3: Infrastructure Code¶

# From: helm/prometheus-values.yaml (recently modified section)
serverFiles:
  prometheus.yml:
    scrape_configs:
      - job_name: 'api-server'
        scrape_interval: 5m
        scrape_timeout: 30s
        static_configs:
          - targets:
            - 'api-server-1:8080'
            - 'api-server-2:8080'
            - 'api-server-3:8080'

    rule_files:
      - /etc/prometheus/rules/*.yaml

# NOTE: rule_files key was moved during a config refactor on Dec 1
# Previously it was under 'serverFiles.alerting.rules'

Artifact 4: Log Lines¶

[2024-12-10T12:00:05Z] prometheus   | level=info msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml
[2024-12-10T12:00:05Z] prometheus   | level=warn msg="No rule files found matching pattern" pattern=/etc/prometheus/rules/*.yaml
[2024-12-01T14:30:11Z] alertmanager | level=info msg="No alerts to notify about" receiver=slack

Your Mission¶

Reconstruct: What does this system do? What are its components and purpose?
Diagnose: What is currently broken or degraded, and why?
Propose: What would you do to fix it? What would you check first?