Synthetic Monitoring — Street-Level Ops¶
Quick Diagnosis Commands¶
# Manually trigger a blackbox probe (bypass Prometheus — direct test)
curl -v "http://blackbox-exporter:9115/probe?target=https://example.com&module=http_2xx"
# Check probe_success for a specific target in Prometheus
promtool query instant http://localhost:9090 \
'probe_success{instance="https://api.example.com/health"}'
# List all endpoints being probed and their current status
promtool query instant http://localhost:9090 \
'probe_success{job=~"blackbox.*"}'
# Find endpoints that have been down in the last hour
promtool query range \
--start=$(date -d '1 hour ago' +%s) \
--end=$(date +%s) \
--step=60 \
http://localhost:9090 \
'min_over_time(probe_success{job="blackbox-http"}[1h]) == 0'
# Check SSL certificate expiry (days remaining)
promtool query instant http://localhost:9090 \
'(probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time()) / 86400'
# Check blackbox exporter health
curl -s http://blackbox-exporter:9115/-/healthy
curl -s http://blackbox-exporter:9115/-/ready
# View blackbox exporter configuration (current loaded config)
curl -s http://blackbox-exporter:9115/config
# Check probe response time breakdown (DNS, TCP, TLS, processing)
promtool query instant http://localhost:9090 \
'probe_http_duration_seconds{job="blackbox-http", instance="https://api.example.com/health"}'
# Find the slowest endpoints
promtool query instant http://localhost:9090 \
'sort_desc(probe_duration_seconds{job="blackbox-http"})'
# Count probes per job to verify scraping is running
promtool query instant http://localhost:9090 \
'count by (job) (probe_success)'
Gotcha: probe_success Returns No Data (Not 0 — Missing)¶
probe_success == 0 means the probe ran and failed. probe_success returning no data at all means Prometheus is not scraping the Blackbox Exporter, or the relabeling is broken.
Diagnosis:
# Step 1: Check if Prometheus is scraping the blackbox exporter job
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job | startswith("blackbox"))'
# Step 2: Look for scrape errors
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job | startswith("blackbox")) | .lastError'
# Step 3: Check the Prometheus config is correct (multi-target pattern)
# The probe target must be in __param_target, NOT __address__
# Common mistake: forgetting the relabel_configs entirely
# Step 4: Test the exporter directly
curl "http://blackbox-exporter:9115/probe?target=https://example.com&module=http_2xx" | grep probe_success
# Step 5: Check the exporter is reachable from Prometheus
kubectl exec -n monitoring prometheus-0 -- curl -s http://blackbox-exporter:9115/-/healthy
The most common cause: missing or wrong relabel_configs in prometheus.yml. Without them, Prometheus tries to scrape https://example.com:9115 directly instead of routing through the exporter.
Under the hood: The Blackbox Exporter uses Prometheus's multi-target exporter pattern. The
__address__label must point to the exporter (e.g.,blackbox:9115), not the target. The actual probe target goes into__param_target. The relabel_configs swap these -- if you skip the relabeling, Prometheus scrapes the target directly on port 9115, gets connection refused, and you see zero data (not even a failed probe).
Gotcha: Probe Succeeds But Service Is Actually Down¶
Your probe checks https://api.example.com/health and returns probe_success = 1. But users report errors. The health endpoint returns 200 even when backends are unhealthy — it only checks that the process is alive, not that the database is connected.
Rule: Do not use health/readiness endpoints as synthetic probe targets unless they perform deep health checks. A pod's /healthz is designed for Kubernetes liveness, not for availability monitoring.
# Better probe targets (confirm real user paths work):
# - Login page (renders correctly, includes auth tokens)
# - A read-only API endpoint that requires database access
# - A page that uses your CDN (confirms CDN is passing traffic)
# Test what your probe actually hits
curl -v https://api.example.com/v1/status # prefer over /health
# Check it hits a real code path, not a shortcut
# Add body validation to your blackbox module
# to confirm response content is correct:
modules:
http_2xx_with_body:
prober: http
http:
valid_status_codes: [200]
fail_if_body_not_matches_regexp:
- '"status":"ok"' # verify body content, not just status code
Pattern: Dependency Probing in Kubernetes¶
# Probe all critical dependencies via TCP to confirm they're reachable
# Add to your blackbox prometheus scrape config:
scrape_configs:
- job_name: "blackbox-dependencies"
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- postgres.production.svc.cluster.local:5432
- redis-primary.production.svc.cluster.local:6379
- kafka.production.svc.cluster.local:9092
- elasticsearch.production.svc.cluster.local:9200
labels:
environment: production
probe_type: dependency
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter.monitoring.svc.cluster.local:9115
Alert on dependency failures:
- alert: DatabaseUnreachable
expr: probe_success{job="blackbox-dependencies", instance=~"postgres.*"} == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "PostgreSQL is unreachable from probe"
description: "TCP probe to {{ $labels.instance }} is failing. Check network policies and database health."
Gotcha: ICMP Probes Fail in Kubernetes Without NET_RAW Capability¶
ICMP (ping) probes require raw socket access. The Blackbox Exporter pod will silently fail ICMP probes if it does not have the NET_RAW capability.
# Blackbox Exporter deployment — ICMP requires NET_RAW
spec:
template:
spec:
containers:
- name: blackbox-exporter
image: prom/blackbox-exporter:latest
securityContext:
capabilities:
add:
- NET_RAW # required for ICMP probes
# If using PodSecurityAdmission restricted policy, NET_RAW is blocked
# Fall back to TCP probes for connectivity checks in restricted clusters
If your cluster uses a restricted PodSecurityAdmission policy, replace ICMP probes with TCP probes to port 80 or another known-open port:
# Replace ICMP ping with TCP echo (or port 80)
modules:
tcp_ping:
prober: tcp
timeout: 5s
tcp:
query_response:
- send: "" # just connect, don't send anything
Scenario: Certificate Expiry Alert — Investigate and Renew¶
An alert fires: SSLCertificateExpiryWarning for https://api.example.com.
# Step 1: Confirm expiry date from Prometheus
promtool query instant http://localhost:9090 \
'probe_ssl_earliest_cert_expiry{instance="https://api.example.com/health"}'
# Returns Unix timestamp — convert:
date -d @$(promtool query instant http://localhost:9090 \
'probe_ssl_earliest_cert_expiry{instance="https://api.example.com/health"}' | jq '.[0].value[1]' -r)
# Step 2: Verify directly with openssl
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | \
openssl x509 -noout -dates
# notAfter= shows expiry date
# Step 3: Check certificate chain (confirm intermediate certs are also valid)
echo | openssl s_client -showcerts -servername api.example.com -connect api.example.com:443 2>/dev/null | \
grep -E "(subject|issuer|notAfter)"
# Step 4: If using cert-manager in Kubernetes
kubectl get certificate -A
kubectl describe certificate api-tls-cert -n production
# Look for: Status Ready, and Expiry date
# Step 5: Force cert-manager to renew
kubectl annotate certificate api-tls-cert cert-manager.io/force-renew="$(date +%s)" -n production
# Step 6: Monitor renewal
kubectl get certificaterequest -n production --watch
Pattern: Blackbox Exporter with Service Discovery (Kubernetes)¶
Instead of static_configs, discover all Services with a specific annotation:
# prometheus.yml — Kubernetes service discovery for blackbox
scrape_configs:
- job_name: "blackbox-k8s-services"
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
namespaces:
names: [production, staging]
relabel_configs:
# Only probe services with synthetic monitoring annotation
- source_labels: [__meta_kubernetes_service_annotation_synthetic_monitoring_enabled]
action: keep
regex: "true"
# Use the annotated path if specified
- source_labels: [__meta_kubernetes_service_annotation_synthetic_monitoring_path]
target_label: __param_target
replacement: "https://$1"
# Fall back to service address + /health
- source_labels: [__address__]
target_label: __param_target
regex: (.+)
replacement: "http://$1/health"
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# Preserve useful service metadata as labels
- source_labels: [__meta_kubernetes_service_name]
target_label: service
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
Enable for a service by adding annotation:
metadata:
annotations:
synthetic-monitoring/enabled: "true"
synthetic-monitoring/path: "api.example.com/health"
Emergency: All Probes Failing Simultaneously¶
All probe_success metrics drop to 0 at the same time. This is almost certainly a monitoring infrastructure problem, not a mass outage.
# Step 1: Rule out Blackbox Exporter being down
curl -s http://blackbox-exporter:9115/-/healthy
kubectl get pods -n monitoring -l app=blackbox-exporter
# Step 2: Rule out Prometheus scrape issue
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "blackbox-http") | .health'
# Step 3: Check if external connectivity from the monitoring namespace is broken
kubectl exec -n monitoring blackbox-exporter-xxx -- curl -s https://google.com -o /dev/null -w "%{http_code}"
# If this fails, the monitoring namespace has lost external connectivity (NetworkPolicy, egress rule)
# Step 4: Check network policies
kubectl get networkpolicy -n monitoring
kubectl describe networkpolicy -n monitoring
# Step 5: If connectivity is fine, check DNS from the exporter pod
kubectl exec -n monitoring blackbox-exporter-xxx -- nslookup api.example.com
# Step 6: Silence the false alerts while investigating
amtool silence add \
--alertmanager.url=http://alertmanager:9093 \
--duration=30m \
--comment="Investigating mass probe failure — likely monitoring infra issue" \
job="blackbox-http"
Useful One-Liners¶
# Manual probe test with timing breakdown
curl -w "dns:%{time_namelookup} tcp:%{time_connect} tls:%{time_appconnect} total:%{time_total}\n" \
-o /dev/null -s https://api.example.com/health
# Check SSL expiry for a domain directly (no Prometheus needed)
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
openssl x509 -noout -enddate | cut -d= -f2
# Check if HTTP response body matches expected content
curl -s https://api.example.com/health | jq '.status'
# Test DNS resolution speed
time nslookup api.example.com 8.8.8.8
# Check response time from multiple endpoints at once
for url in https://api.example.com/health https://app.example.com https://payments.example.com/status; do
printf "%s: " "$url"
curl -o /dev/null -s -w "%{time_total}s %{http_code}\n" "$url"
done
# Port-forward blackbox UI (shows debug info)
kubectl port-forward -n monitoring svc/blackbox-exporter 9115:9115
# List all endpoints with their 30-day availability
promtool query instant http://localhost:9090 \
'sort(avg_over_time(probe_success{job="blackbox-http"}[30d]) * 100)'
# Find certificates expiring in under 14 days
promtool query instant http://localhost:9090 \
'probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 14 * 86400'
# Reload blackbox exporter config without restart
curl -X POST http://blackbox-exporter:9115/-/reload
Gotcha: Synthetic probes from a single location give you uptime from that location's perspective, not your users'. A probe in us-east-1 will not detect a routing issue that only affects European users. For real availability measurement, run probes from at least two geographically distinct locations and alert on concurrent failures to reduce false positives. ```text