Synthetic Monitoring — Street-Level Ops¶

Quick Diagnosis Commands¶

# Manually trigger a blackbox probe (bypass Prometheus — direct test)
curl -v "http://blackbox-exporter:9115/probe?target=https://example.com&module=http_2xx"

# Check probe_success for a specific target in Prometheus
promtool query instant http://localhost:9090 \
  'probe_success{instance="https://api.example.com/health"}'

# List all endpoints being probed and their current status
promtool query instant http://localhost:9090 \
  'probe_success{job=~"blackbox.*"}'

# Find endpoints that have been down in the last hour
promtool query range \
  --start=$(date -d '1 hour ago' +%s) \
  --end=$(date +%s) \
  --step=60 \
  http://localhost:9090 \
  'min_over_time(probe_success{job="blackbox-http"}[1h]) == 0'

# Check SSL certificate expiry (days remaining)
promtool query instant http://localhost:9090 \
  '(probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time()) / 86400'

# Check blackbox exporter health
curl -s http://blackbox-exporter:9115/-/healthy
curl -s http://blackbox-exporter:9115/-/ready

# View blackbox exporter configuration (current loaded config)
curl -s http://blackbox-exporter:9115/config

# Check probe response time breakdown (DNS, TCP, TLS, processing)
promtool query instant http://localhost:9090 \
  'probe_http_duration_seconds{job="blackbox-http", instance="https://api.example.com/health"}'

# Find the slowest endpoints
promtool query instant http://localhost:9090 \
  'sort_desc(probe_duration_seconds{job="blackbox-http"})'

# Count probes per job to verify scraping is running
promtool query instant http://localhost:9090 \
  'count by (job) (probe_success)'

Gotcha: probe_success Returns No Data (Not 0 — Missing)¶

probe_success == 0 means the probe ran and failed. probe_success returning no data at all means Prometheus is not scraping the Blackbox Exporter, or the relabeling is broken.

Diagnosis:

# Step 1: Check if Prometheus is scraping the blackbox exporter job
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job | startswith("blackbox"))'

# Step 2: Look for scrape errors
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job | startswith("blackbox")) | .lastError'

# Step 3: Check the Prometheus config is correct (multi-target pattern)
# The probe target must be in __param_target, NOT __address__
# Common mistake: forgetting the relabel_configs entirely

# Step 4: Test the exporter directly
curl "http://blackbox-exporter:9115/probe?target=https://example.com&module=http_2xx" | grep probe_success

# Step 5: Check the exporter is reachable from Prometheus
kubectl exec -n monitoring prometheus-0 -- curl -s http://blackbox-exporter:9115/-/healthy

The most common cause: missing or wrong relabel_configs in prometheus.yml. Without them, Prometheus tries to scrape https://example.com:9115 directly instead of routing through the exporter.

Under the hood: The Blackbox Exporter uses Prometheus's multi-target exporter pattern. The __address__ label must point to the exporter (e.g., blackbox:9115), not the target. The actual probe target goes into __param_target. The relabel_configs swap these -- if you skip the relabeling, Prometheus scrapes the target directly on port 9115, gets connection refused, and you see zero data (not even a failed probe).

Gotcha: Probe Succeeds But Service Is Actually Down¶

Your probe checks https://api.example.com/health and returns probe_success = 1. But users report errors. The health endpoint returns 200 even when backends are unhealthy — it only checks that the process is alive, not that the database is connected.

Rule: Do not use health/readiness endpoints as synthetic probe targets unless they perform deep health checks. A pod's /healthz is designed for Kubernetes liveness, not for availability monitoring.

# Better probe targets (confirm real user paths work):
# - Login page (renders correctly, includes auth tokens)
# - A read-only API endpoint that requires database access
# - A page that uses your CDN (confirms CDN is passing traffic)

# Test what your probe actually hits
curl -v https://api.example.com/v1/status  # prefer over /health
# Check it hits a real code path, not a shortcut

# Add body validation to your blackbox module
# to confirm response content is correct:
modules:
  http_2xx_with_body:
    prober: http
    http:
      valid_status_codes: [200]
      fail_if_body_not_matches_regexp:
        - '"status":"ok"'   # verify body content, not just status code

Pattern: Dependency Probing in Kubernetes¶

# Probe all critical dependencies via TCP to confirm they're reachable
# Add to your blackbox prometheus scrape config:

scrape_configs:
  - job_name: "blackbox-dependencies"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
          - postgres.production.svc.cluster.local:5432
          - redis-primary.production.svc.cluster.local:6379
          - kafka.production.svc.cluster.local:9092
          - elasticsearch.production.svc.cluster.local:9200
        labels:
          environment: production
          probe_type: dependency
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter.monitoring.svc.cluster.local:9115

Alert on dependency failures:

- alert: DatabaseUnreachable
  expr: probe_success{job="blackbox-dependencies", instance=~"postgres.*"} == 0
  for: 1m
  labels:
    severity: critical
    team: platform
  annotations:
    summary: "PostgreSQL is unreachable from probe"
    description: "TCP probe to {{ $labels.instance }} is failing. Check network policies and database health."

Gotcha: ICMP Probes Fail in Kubernetes Without NET_RAW Capability¶

ICMP (ping) probes require raw socket access. The Blackbox Exporter pod will silently fail ICMP probes if it does not have the NET_RAW capability.

# Blackbox Exporter deployment — ICMP requires NET_RAW
spec:
  template:
    spec:
      containers:
        - name: blackbox-exporter
          image: prom/blackbox-exporter:latest
          securityContext:
            capabilities:
              add:
                - NET_RAW    # required for ICMP probes
          # If using PodSecurityAdmission restricted policy, NET_RAW is blocked
          # Fall back to TCP probes for connectivity checks in restricted clusters

If your cluster uses a restricted PodSecurityAdmission policy, replace ICMP probes with TCP probes to port 80 or another known-open port:

# Replace ICMP ping with TCP echo (or port 80)
modules:
  tcp_ping:
    prober: tcp
    timeout: 5s
    tcp:
      query_response:
        - send: ""  # just connect, don't send anything

Scenario: Certificate Expiry Alert — Investigate and Renew¶

An alert fires: SSLCertificateExpiryWarning for https://api.example.com.

# Step 1: Confirm expiry date from Prometheus
promtool query instant http://localhost:9090 \
  'probe_ssl_earliest_cert_expiry{instance="https://api.example.com/health"}'
# Returns Unix timestamp — convert:
date -d @$(promtool query instant http://localhost:9090 \
  'probe_ssl_earliest_cert_expiry{instance="https://api.example.com/health"}' | jq '.[0].value[1]' -r)

# Step 2: Verify directly with openssl
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | \
  openssl x509 -noout -dates
# notAfter= shows expiry date

# Step 3: Check certificate chain (confirm intermediate certs are also valid)
echo | openssl s_client -showcerts -servername api.example.com -connect api.example.com:443 2>/dev/null | \
  grep -E "(subject|issuer|notAfter)"

# Step 4: If using cert-manager in Kubernetes
kubectl get certificate -A
kubectl describe certificate api-tls-cert -n production
# Look for: Status Ready, and Expiry date

# Step 5: Force cert-manager to renew
kubectl annotate certificate api-tls-cert cert-manager.io/force-renew="$(date +%s)" -n production

# Step 6: Monitor renewal
kubectl get certificaterequest -n production --watch

Pattern: Blackbox Exporter with Service Discovery (Kubernetes)¶

Instead of static_configs, discover all Services with a specific annotation:

# prometheus.yml — Kubernetes service discovery for blackbox
scrape_configs:
  - job_name: "blackbox-k8s-services"
    metrics_path: /probe
    params:
      module: [http_2xx]
    kubernetes_sd_configs:
      - role: service
        namespaces:
          names: [production, staging]
    relabel_configs:
      # Only probe services with synthetic monitoring annotation
      - source_labels: [__meta_kubernetes_service_annotation_synthetic_monitoring_enabled]
        action: keep
        regex: "true"
      # Use the annotated path if specified
      - source_labels: [__meta_kubernetes_service_annotation_synthetic_monitoring_path]
        target_label: __param_target
        replacement: "https://$1"
      # Fall back to service address + /health
      - source_labels: [__address__]
        target_label: __param_target
        regex: (.+)
        replacement: "http://$1/health"
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
      # Preserve useful service metadata as labels
      - source_labels: [__meta_kubernetes_service_name]
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace

Enable for a service by adding annotation:

metadata:
  annotations:
    synthetic-monitoring/enabled: "true"
    synthetic-monitoring/path: "api.example.com/health"

Emergency: All Probes Failing Simultaneously¶

All probe_success metrics drop to 0 at the same time. This is almost certainly a monitoring infrastructure problem, not a mass outage.

# Step 1: Rule out Blackbox Exporter being down
curl -s http://blackbox-exporter:9115/-/healthy
kubectl get pods -n monitoring -l app=blackbox-exporter

# Step 2: Rule out Prometheus scrape issue
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "blackbox-http") | .health'

# Step 3: Check if external connectivity from the monitoring namespace is broken
kubectl exec -n monitoring blackbox-exporter-xxx -- curl -s https://google.com -o /dev/null -w "%{http_code}"
# If this fails, the monitoring namespace has lost external connectivity (NetworkPolicy, egress rule)

# Step 4: Check network policies
kubectl get networkpolicy -n monitoring
kubectl describe networkpolicy -n monitoring

# Step 5: If connectivity is fine, check DNS from the exporter pod
kubectl exec -n monitoring blackbox-exporter-xxx -- nslookup api.example.com

# Step 6: Silence the false alerts while investigating
amtool silence add \
  --alertmanager.url=http://alertmanager:9093 \
  --duration=30m \
  --comment="Investigating mass probe failure — likely monitoring infra issue" \
  job="blackbox-http"

Useful One-Liners¶

# Manual probe test with timing breakdown
curl -w "dns:%{time_namelookup} tcp:%{time_connect} tls:%{time_appconnect} total:%{time_total}\n" \
  -o /dev/null -s https://api.example.com/health

# Check SSL expiry for a domain directly (no Prometheus needed)
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
  openssl x509 -noout -enddate | cut -d= -f2

# Check if HTTP response body matches expected content
curl -s https://api.example.com/health | jq '.status'

# Test DNS resolution speed
time nslookup api.example.com 8.8.8.8

# Check response time from multiple endpoints at once
for url in https://api.example.com/health https://app.example.com https://payments.example.com/status; do
  printf "%s: " "$url"
  curl -o /dev/null -s -w "%{time_total}s %{http_code}\n" "$url"
done

# Port-forward blackbox UI (shows debug info)
kubectl port-forward -n monitoring svc/blackbox-exporter 9115:9115

# List all endpoints with their 30-day availability
promtool query instant http://localhost:9090 \
  'sort(avg_over_time(probe_success{job="blackbox-http"}[30d]) * 100)'

# Find certificates expiring in under 14 days
promtool query instant http://localhost:9090 \
  'probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 14 * 86400'

# Reload blackbox exporter config without restart
curl -X POST http://blackbox-exporter:9115/-/reload

Gotcha: Synthetic probes from a single location give you uptime from that location's perspective, not your users'. A probe in us-east-1 will not detect a routing issue that only affects European users. For real availability measurement, run probes from at least two geographically distinct locations and alert on concurrent failures to reduce false positives. ```text