Skip to content

Answer Key: The Alerts That Stopped Firing

The System

A standard Prometheus monitoring stack watching a 3-node API server cluster:

[api-server-1:8080] --\
[api-server-2:8080] ----> [Prometheus] --evaluates--> [Alert Rules]
[api-server-3:8080] --/       |                            |
                              |                       [Alertmanager]
                         [Grafana]                         |
                              |                       [Slack Webhook]
                         Dashboards

Prometheus scrapes metrics from the API servers, evaluates alerting rules, and fires alerts to Alertmanager, which routes to Slack. Grafana queries Prometheus for dashboard visualization.

What's Broken

Root cause: On December 1, a configuration refactor moved the rule_files directive in the Prometheus Helm values. The configured path (/etc/prometheus/rules/*.yaml) does not match where the Helm chart actually mounts the rule files. Prometheus starts successfully and scrapes targets normally, but finds no rule files at the specified path.

The result: 1. Zero alerting rules are loaded — prometheus_rule_group_last_evaluation_timestamp_seconds is 0 2. Alertmanager receives no alerts — "No alerts to notify about" 3. The system could be on fire and no one would be paged

Secondary issue: the scrape interval was changed to 5 minutes (visible in prometheus_target_interval_length_seconds{interval="5m0s"}). If the previous interval was 15s or 30s, Grafana dashboards that use rate() or irate() with short ranges will show sparse or incorrect data.

Key clue: The Prometheus log warning "No rule files found matching pattern /etc/prometheus/rules/*.yaml" is the smoking gun. The rule evaluation metrics at 0 confirm no rules are loaded.

The Fix

Immediate (find and fix the rule path)

# Find where rules are actually mounted
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- find / -name "*.yaml" -path "*/rules/*" 2>/dev/null

# Check the Helm chart's actual mount paths
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- ls -la /etc/config/

# Check Prometheus config to see what it expects
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- cat /etc/config/prometheus.yml | grep rule_files -A5

Permanent (fix in Helm values)

Update rule_files to match the Helm chart's mount path:

serverFiles:
  prometheus.yml:
    rule_files:
      - /etc/config/rules/*.yaml    # Match the actual Helm chart mount point

Also verify the scrape interval is intentional:

    scrape_configs:
      - job_name: 'api-server'
        scrape_interval: 15s    # Restore if 5m was accidental
        scrape_timeout: 10s

Then redeploy:

helm upgrade prometheus prometheus-community/prometheus \
  -n monitoring -f helm/prometheus-values.yaml

# Verify rules are loaded
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- \
  promtool query instant http://localhost:9090 'prometheus_rule_group_rules{rule_group="api-server.rules"}'

Verification

# Check rules are loaded
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- \
  curl -s http://localhost:9090/api/v1/rules | jq '.data.groups | length'

# Check rule evaluation is happening
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- \
  promtool query instant http://localhost:9090 'prometheus_rule_group_last_evaluation_timestamp_seconds'

# Trigger a test alert
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- \
  curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.state=="firing")'

# Check Alertmanager received something
kubectl logs -n monitoring alertmanager-0 --tail=10

Artifact Decoder

Artifact What It Revealed What Was Misleading
CLI Output up returns 1 for all targets — scraping works; all monitoring pods Running Everything looks healthy — the problem is invisible from standard status checks
Metrics Rule evaluation timestamp = 0 means no rules loaded; scrape interval is 5m (possibly changed) up{} => 1 and scrape working normally makes Prometheus look fully functional
IaC Snippet rule_files path does not match Helm chart mount; comment reveals recent refactor The config looks syntactically correct — the path just does not match the filesystem
Log Lines "No rule files found matching pattern" is the definitive clue "Completed loading of configuration file" makes it sound like everything loaded successfully

Skills Demonstrated

  • Understanding the Prometheus rule evaluation pipeline (scrape -> evaluate -> alert)
  • Using Prometheus self-monitoring metrics to diagnose configuration issues
  • Reading Prometheus logs for configuration warnings
  • Understanding Helm chart file mounting mechanics
  • Recognizing the difference between "monitoring is up" and "alerting is functional"

Prerequisite Topic Packs