Answer Key: The Alerts That Stopped Firing¶
The System¶
A standard Prometheus monitoring stack watching a 3-node API server cluster:
[api-server-1:8080] --\
[api-server-2:8080] ----> [Prometheus] --evaluates--> [Alert Rules]
[api-server-3:8080] --/ | |
| [Alertmanager]
[Grafana] |
| [Slack Webhook]
Dashboards
Prometheus scrapes metrics from the API servers, evaluates alerting rules, and fires alerts to Alertmanager, which routes to Slack. Grafana queries Prometheus for dashboard visualization.
What's Broken¶
Root cause: On December 1, a configuration refactor moved the rule_files directive in the Prometheus Helm values. The configured path (/etc/prometheus/rules/*.yaml) does not match where the Helm chart actually mounts the rule files. Prometheus starts successfully and scrapes targets normally, but finds no rule files at the specified path.
The result:
1. Zero alerting rules are loaded — prometheus_rule_group_last_evaluation_timestamp_seconds is 0
2. Alertmanager receives no alerts — "No alerts to notify about"
3. The system could be on fire and no one would be paged
Secondary issue: the scrape interval was changed to 5 minutes (visible in prometheus_target_interval_length_seconds{interval="5m0s"}). If the previous interval was 15s or 30s, Grafana dashboards that use rate() or irate() with short ranges will show sparse or incorrect data.
Key clue: The Prometheus log warning "No rule files found matching pattern /etc/prometheus/rules/*.yaml" is the smoking gun. The rule evaluation metrics at 0 confirm no rules are loaded.
The Fix¶
Immediate (find and fix the rule path)¶
# Find where rules are actually mounted
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- find / -name "*.yaml" -path "*/rules/*" 2>/dev/null
# Check the Helm chart's actual mount paths
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- ls -la /etc/config/
# Check Prometheus config to see what it expects
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- cat /etc/config/prometheus.yml | grep rule_files -A5
Permanent (fix in Helm values)¶
Update rule_files to match the Helm chart's mount path:
serverFiles:
prometheus.yml:
rule_files:
- /etc/config/rules/*.yaml # Match the actual Helm chart mount point
Also verify the scrape interval is intentional:
scrape_configs:
- job_name: 'api-server'
scrape_interval: 15s # Restore if 5m was accidental
scrape_timeout: 10s
Then redeploy:
helm upgrade prometheus prometheus-community/prometheus \
-n monitoring -f helm/prometheus-values.yaml
# Verify rules are loaded
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- \
promtool query instant http://localhost:9090 'prometheus_rule_group_rules{rule_group="api-server.rules"}'
Verification¶
# Check rules are loaded
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- \
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups | length'
# Check rule evaluation is happening
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- \
promtool query instant http://localhost:9090 'prometheus_rule_group_last_evaluation_timestamp_seconds'
# Trigger a test alert
kubectl exec -n monitoring prometheus-server-0 -c prometheus -- \
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.state=="firing")'
# Check Alertmanager received something
kubectl logs -n monitoring alertmanager-0 --tail=10
Artifact Decoder¶
| Artifact | What It Revealed | What Was Misleading |
|---|---|---|
| CLI Output | up returns 1 for all targets — scraping works; all monitoring pods Running |
Everything looks healthy — the problem is invisible from standard status checks |
| Metrics | Rule evaluation timestamp = 0 means no rules loaded; scrape interval is 5m (possibly changed) | up{} => 1 and scrape working normally makes Prometheus look fully functional |
| IaC Snippet | rule_files path does not match Helm chart mount; comment reveals recent refactor |
The config looks syntactically correct — the path just does not match the filesystem |
| Log Lines | "No rule files found matching pattern" is the definitive clue | "Completed loading of configuration file" makes it sound like everything loaded successfully |
Skills Demonstrated¶
- Understanding the Prometheus rule evaluation pipeline (scrape -> evaluate -> alert)
- Using Prometheus self-monitoring metrics to diagnose configuration issues
- Reading Prometheus logs for configuration warnings
- Understanding Helm chart file mounting mechanics
- Recognizing the difference between "monitoring is up" and "alerting is functional"