Skip to content

Synthetic Monitoring Footguns


1. Probing Health Endpoints That Bypass Real Logic

You probe /health or /readyz and consider the service monitored. These endpoints return 200 as long as the process is alive — they do not verify database connections, cache availability, or actual request processing. The database goes down; /health keeps returning 200; your synthetic monitor reports the service is up while real users get 500 errors.

Fix: Probe endpoints that exercise real application logic. Choose a lightweight read-only API endpoint that requires a database query (e.g., GET /v1/ping that queries SELECT 1 and returns the result), not a process-alive check. Supplement with dependency probes (TCP to the database port directly) so you have layered visibility: is the process alive? Is the database reachable? Can the process query the database?


2. Not Using for: Duration — Single-Probe Failure Triggers Page

You configure probe_success == 0 to immediately page on-call. A single probe failure (network hiccup, probe timeout, DNS flap) wakes someone at 3 AM. The endpoint was never actually down — one probe missed and auto-recovered. False positives destroy trust in synthetic monitoring faster than anything else.

Fix: Always set a minimum for: duration before alerting. For 99.9% SLO services, for: 2m with 30-second probe interval means you need 4 consecutive probe failures before paging — filters out transient flaps while still detecting real outages quickly. For 99.99% SLO services, you may need a shorter for: (1 minute) but ensure your probe interval is correspondingly tighter (10–15 seconds).


3. Wrong Relabeling in Multi-Target Pattern

You set up the Blackbox Exporter multi-target pattern but forget the relabel_configs. Prometheus tries to scrape https://example.com:9115 directly instead of routing through the exporter. The scrape fails silently with a connection error. You see no probe_success metrics at all — but you think the endpoint is being monitored.

Fix: The multi-target pattern requires exactly these three relabel rules in order — this is not optional boilerplate:

relabel_configs:
  # 1. Move the target URL to the probe parameter
  - source_labels: [__address__]
    target_label: __param_target
  # 2. Set the instance label to the URL (for human-readable labels)
  - source_labels: [__param_target]
    target_label: instance
  # 3. Set the scrape address to the actual blackbox exporter
  - target_label: __address__
    replacement: blackbox-exporter:9115
After any prometheus.yml change, verify with curl "http://blackbox-exporter:9115/probe?target=YOUR_URL&module=http_2xx" to confirm the exporter can reach the target.


4. Alerting from a Single Geographic Region

Your Blackbox Exporter runs in us-east-1. Your CDN has a routing bug that affects only European users — traffic from Europe is hitting a broken origin but US traffic is fine. Your probe returns success. European users experience outages for hours before a support ticket surfaces it.

Fix: Run synthetic probes from at least three geographic regions that match your user distribution. Alert only when probes from 2+ regions fail simultaneously (reducing false positives). Managed services like Checkly and Grafana Synthetic Monitoring include multi-region as a built-in feature. For self-hosted, deploy a lightweight Blackbox Exporter in each region and aggregate in a central Prometheus with a probe_region label.

War story: During the 2017 S3 outage in us-east-1, teams monitoring only from us-east-1 saw their own monitoring infrastructure fail alongside the services it was monitoring. Multi-region probes from eu-west-1 and ap-southeast-1 correctly detected the outage and triggered alerts, while single-region setups went dark.


5. Setting Probe Timeout Shorter Than Your SLO Latency Threshold

Your latency SLO is p99 < 3 seconds. Your blackbox probe timeout is 5 seconds. Under normal conditions this works. During a slowdown where responses take 4 seconds (violating SLO but not causing errors), your probe still succeeds within its 5-second window. You report 100% availability while your users experience SLO breaches.

Fix: Set the probe timeout to match or be slightly above your normal expected response time, and use the probe_duration_seconds metric with a separate alert to catch slowdowns:

- alert: EndpointSlow
  expr: probe_duration_seconds{job="blackbox-http"} > 2.0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Endpoint responding slowly (>2s)"
Probe timeout and latency alerting serve different purposes: timeout detects total unavailability; duration metrics detect degradation.


6. Checking the Wrong Protocol (HTTP When Service Uses HTTPS)

You probe http://api.example.com instead of https://api.example.com. The HTTP endpoint redirects to HTTPS. Your blackbox module has follow_redirects: false (default in some configs). The probe returns a 301 response code. Your module expects 2xx. probe_success = 0. You get paged for a non-existent outage.

Fix: Always probe the exact URL your users access, including protocol. Use follow_redirects: true in your module if redirects are expected, or explicitly allow the redirect status code. Run the manual probe first before committing it to your config:

curl -I http://api.example.com  # check if it redirects
curl "http://blackbox-exporter:9115/probe?target=http://api.example.com&module=http_2xx" | grep probe_http_status_code


7. Probing Internal Endpoints From an External Monitor

You configure Checkly to probe https://internal-api.corp.example.com — an endpoint only accessible within your VPC. All Checkly checks fail because they run from external infrastructure. You interpret the failures as outages and disable the checks. You lose visibility into that service entirely.

Fix: Match the probe location to the endpoint's accessibility. Internal endpoints need probes running inside the same network (VPC, cluster). Use the Blackbox Exporter running inside Kubernetes for internal endpoints, and external services (Checkly, Grafana Cloud Synthetic Monitoring) only for public-facing endpoints. If you need external-style probes for an internal endpoint, you need an externally accessible URL or a probe agent deployed inside your network.


8. Not Monitoring DNS Separately From HTTP

Your website goes down because DNS resolution is failing. Your HTTP probe also fails, correctly reporting probe_success = 0. But your alert says "HTTP endpoint down" — the on-call engineer starts debugging the application and web server before realizing DNS is the problem. This adds 15 minutes to your MTTR.

Fix: Run separate DNS probes that report DNS resolution specifically:

modules:
  dns_check:
    prober: dns
    timeout: 5s
    dns:
      query_name: api.example.com
      query_type: A
      valid_rcodes: [NOERROR]
      validate_answer_rrs:
        fail_if_none_match_regexp:
          - ".*"  # fail if no answer records
When DNS fails, the DNS alert fires first, telling on-call exactly what is wrong before the HTTP probe even has a chance to fire.


9. Ignoring Certificate Expiry Until It Is Critical

Your certificate expiry alert fires at 7 days warning. The team is in a product sprint. The ticket sits in the backlog. On day 6, the auto-renewal fails due to a Let's Encrypt rate limit. The certificate expires. Users see browser warnings. You scramble to emergency-renew under time pressure.

Fix: Alert at 30 days (warning, create ticket immediately), 14 days (escalate, block other work), and 7 days (critical, page on-call). The 30-day warning gives enough runway to handle failures in the renewal pipeline. If using cert-manager, also alert on CertificateNotReady conditions:

kubectl get certificate -A -o json | \
  jq '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name'


10. Using Synthetic Monitoring as a Substitute for Real Alerting

You have Blackbox probes on all your endpoints and feel your observability is complete. You remove your Prometheus alerting rules based on application metrics because "synthetic monitoring catches everything." A database query starts timing out intermittently — requests take 8 seconds instead of 200ms. Your probe timeout is 10 seconds, so probe_success = 1. No alert fires. Users are experiencing severe degradation.

Fix: Synthetic monitoring complements application metrics — it does not replace them. Synthetic probes detect total unavailability and connectivity issues. Application metrics (error rates, latency histograms, queue depths) detect degradation, partial failures, and tail latency issues. Run both in parallel. Your observability strategy needs at least: synthetic probes for availability, application metrics for SLO, structured logs for debugging, and traces for request flow.


11. Probe Targets Include Staging Endpoints Mixed With Production

Your monitoring tracks 50 endpoints. Half are production, half are staging. Staging goes down during a deploy. An alert fires. On-call wakes up, investigates, discovers it is staging, goes back to sleep. After three weeks of this, on-call starts ignoring all synthetic alerts because false positives from staging have destroyed the signal.

Fix: Separate production and staging scrape jobs with different alerting rules. Staging probes fire as low-severity tickets (not pages) or route to a different Slack channel. Use the environment label to distinguish environments, and scope your PagerDuty routes to only fire on environment="production".


12. Not Testing Probe Config Changes Before Deploying

You update blackbox.yml to add a new module, make a typo in the YAML, reload the exporter. The exporter rejects the config and falls back to an empty configuration. All running probes stop. probe_success disappears from Prometheus. You have no availability monitoring for your production services until the error is found and fixed.

Fix: Always validate blackbox.yml before reloading:

# Validate config before applying
docker run --rm \
  -v $(pwd)/blackbox.yml:/config/blackbox.yml \
  prom/blackbox-exporter:latest \
  --config.file=/config/blackbox.yml \
  --dry-run 2>&1 | grep -E "(error|ok)"

# Or use the check endpoint (if supported by your version)
curl -s -X POST http://blackbox-exporter:9115/-/validate \
  --data-binary @blackbox.yml
In Kubernetes, use a ConfigMap with a pre-apply validation step in your CI pipeline before the Helm upgrade or kubectl apply.

Default trap: The Blackbox Exporter does not validate module names at probe time — if you request ?module=htpt_2xx (typo), the exporter returns probe_success 0 with no error indicating the module doesn't exist. Your alert fires as if the target is down. Always verify module names match exactly and add a startup check that exercises each module.