Skip to content

Portal | Level: L1: Foundations | Topics: Synthetic Monitoring | Domain: Observability

Synthetic Monitoring — Primer

Why This Matters

Real User Monitoring (RUM) tells you what your users experienced. Synthetic monitoring tells you what they would experience right now, before they try. You send automated probes to your endpoints every 30 seconds and alert on failures — without waiting for a user to hit the broken endpoint and complain.

This is especially important for: - Detecting outages before users notice: A probe hitting your login page every 30 seconds catches it before your support inbox fills up - External availability: Your service might look healthy internally (Kubernetes pods running, internal metrics fine) but be unreachable from the internet due to a firewall rule, DNS change, or certificate expiry - Certificate expiry: Synthetic probes can alert 30 days before a certificate expires — before it causes an outage - Dependency health: Probing your database connectivity, cache, and downstream APIs to confirm they are reachable — not just that Kubernetes says the pods are running

The Prometheus Blackbox Exporter is the standard open-source tool. Checkly and k6 cover browser-based and scripted checks.


Core Concepts

1. Real User Monitoring vs Synthetic Monitoring

Dimension Real User Monitoring (RUM) Synthetic Monitoring
Data source Actual user traffic Automated probes
Latency to detect Seconds to minutes after user impact Near-real-time (probe interval)
Pre-traffic detection No — needs users Yes — probes run without traffic
Coverage Only pages users visit Any endpoint, any time
Browser simulation Yes Optional (Checkly, Playwright)
Off-hours monitoring Only if users active Always
Cost Low (piggybacks on traffic) Per-probe compute cost
Best for User experience data Availability monitoring

Both are complementary. RUM shows you the real user experience distribution; synthetic shows you external availability.

2. Blackbox Exporter — Overview

Under the hood: The Blackbox Exporter uses a "multi-target" pattern unique among Prometheus exporters. Instead of scraping a single endpoint, Prometheus passes the target URL as a query parameter to the exporter, which probes it on demand. This means one exporter instance can monitor thousands of endpoints without any per-target configuration in the exporter itself.

The Prometheus Blackbox Exporter probes endpoints and returns metrics. Prometheus scrapes the exporter, which runs a probe on demand and returns results:

┌──────────────┐        ┌────────────────────┐       ┌────────────┐
│  Prometheus  │──scrape─▶  Blackbox Exporter  │──probe─▶  Target    │
│              │◀─metrics─│  (HTTP/TCP/ICMP/DNS)│◀──resp─│  Endpoint  │
└──────────────┘        └────────────────────┘       └────────────┘

Prometheus sends the target URL as a parameter in the scrape request. The exporter probes it and returns probe_success (0 or 1) plus timing metrics.

Install Blackbox Exporter:

# Docker
docker run -d \
  --name blackbox-exporter \
  -p 9115:9115 \
  -v $(pwd)/blackbox.yml:/etc/blackbox_exporter/config.yml \
  prom/blackbox-exporter:latest

# Kubernetes (via Prometheus community Helm chart)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install blackbox-exporter \
  prometheus-community/prometheus-blackbox-exporter \
  --namespace monitoring \
  --set config.modules.http_2xx.prober=http \
  --set config.modules.http_2xx.http.valid_status_codes={}

# Binary
wget https://github.com/prometheus/blackbox_exporter/releases/latest/download/blackbox_exporter-linux-amd64.tar.gz
tar -xzf blackbox_exporter-*.tar.gz
./blackbox_exporter --config.file=blackbox.yml

3. Blackbox Exporter — Prober Modules

Blackbox configuration file:

# blackbox.yml
modules:
  # HTTP probe — checks for 2xx response
  http_2xx:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: []  # defaults to 2xx
      method: GET
      follow_redirects: true
      preferred_ip_protocol: ip4
      tls_config:
        insecure_skip_verify: false
      headers:
        User-Agent: "BlackboxExporter/1.0"

  # HTTP probe with authentication
  http_2xx_auth:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: [200]
      method: GET
      headers:
        Authorization: "Bearer {{ .module.http_2xx_auth.bearer_token }}"
      bearer_token_file: /etc/blackbox/token

  # POST probe — API endpoint check
  http_post_2xx:
    prober: http
    timeout: 10s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"healthcheck": true}'
      valid_status_codes: [200, 201]

  # TCP probe — checks port is open
  tcp_connect:
    prober: tcp
    timeout: 5s

  # ICMP probe — ping check (requires NET_RAW capability)
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: ip4

  # DNS probe — checks domain resolves correctly
  dns_soa:
    prober: dns
    timeout: 5s
    dns:
      transport_protocol: udp
      preferred_ip_protocol: ip4
      query_name: example.com
      query_type: SOA
      validate_answer_rrs:
        fail_if_matches_regexp:
          - ".*127.0.0.1.*"  # alert if DNS poisoned

  # HTTPS with certificate validation
  https_2xx:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: [200]
      tls_config:
        insecure_skip_verify: false
      fail_if_ssl: false
      fail_if_not_ssl: true  # require HTTPS

4. Prometheus Scrape Config — Multi-Target Pattern

The canonical pattern for Blackbox Exporter uses the multi-target pattern: a single Prometheus job scrapes many targets through one exporter instance:

# prometheus.yml

scrape_configs:
  # HTTP endpoint checks
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
          - https://app.example.com/login
          - https://payments.example.com/status
    relabel_configs:
      # Move target URL to probe param
      - source_labels: [__address__]
        target_label: __param_target
      # Set instance label to the URL being probed
      - source_labels: [__param_target]
        target_label: instance
      # Send all requests to the exporter
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # TCP connectivity checks (database, cache, etc.)
  - job_name: "blackbox-tcp"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
          - postgres.internal:5432
          - redis.internal:6379
          - kafka.internal:9092
        labels:
          probe_type: dependency
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # ICMP ping checks for network nodes
  - job_name: "blackbox-icmp"
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
          - 10.0.0.1   # default gateway
          - 8.8.8.8    # external DNS
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

5. Key Blackbox Exporter Metrics

probe_success                    # 1 = probe succeeded, 0 = failed (most important)
probe_duration_seconds           # total probe duration (useful for SLO)
probe_http_status_code           # HTTP status code returned
probe_http_duration_seconds      # HTTP phase breakdown (DNS, TCP, TLS, processing)
probe_http_ssl                   # 1 = connection used SSL
probe_ssl_earliest_cert_expiry   # Unix timestamp of nearest expiring cert
probe_dns_lookup_time_seconds    # DNS resolution time
probe_tcp_connect_duration_seconds  # TCP connect time
probe_failed_due_to_regex        # 1 if body validation regex failed

6. SSL Certificate Expiry Alerting

One of the most valuable uses of the Blackbox Exporter:

# Prometheus alerting rules for SSL certificates
groups:
  - name: ssl.certificate
    rules:
      # Certificate expires in less than 7 days — page
      - alert: SSLCertificateExpiryCritical
        expr: |
          probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 7 * 24 * 3600
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "SSL certificate expires in < 7 days"
          description: "Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"

      # Certificate expires in less than 30 days — warn
      - alert: SSLCertificateExpiryWarning
        expr: |
          probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 30 * 24 * 3600
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expires in < 30 days"
          description: "Certificate for {{ $labels.instance }} expires {{ $value | humanizeDuration }} from now. Renew before it expires."

      # Endpoint is down
      - alert: EndpointDown
        expr: probe_success{job="blackbox-http"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Endpoint is unreachable"
          description: "{{ $labels.instance }} has been unreachable for 2 minutes"

      # Slow response time
      - alert: SlowEndpoint
        expr: probe_duration_seconds{job="blackbox-http"} > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Endpoint response time > 2 seconds"
          description: "{{ $labels.instance }} is responding slowly: {{ $value }}s"

      # Dependency TCP port unreachable
      - alert: DependencyUnreachable
        expr: probe_success{job="blackbox-tcp"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Dependency port unreachable"
          description: "TCP probe to {{ $labels.instance }} ({{ $labels.probe_type }}) is failing"

7. Grafana Dashboard for Blackbox Exporter

Key panels for a Blackbox dashboard:

# Availability heatmap (for each endpoint, 0 or 1 over time)
probe_success{job="blackbox-http"}

# HTTP response time by endpoint
probe_duration_seconds{job="blackbox-http"}

# HTTP response breakdown (DNS, TCP, TLS, processing)
# DNS resolution time
probe_http_duration_seconds{phase="resolve"}
# TCP connect time
probe_http_duration_seconds{phase="connect"}
# TLS handshake time
probe_http_duration_seconds{phase="tls"}
# Server processing time
probe_http_duration_seconds{phase="processing"}

# SSL certificate expiry (days remaining)
(probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time()) / 86400

# 30-day availability percentage per endpoint
avg_over_time(probe_success{job="blackbox-http"}[30d]) * 100

# Current status of all endpoints (useful for status page panel)
min by (instance) (probe_success{job="blackbox-http"})

8. Checkly — Browser and API Checks

Checkly provides managed synthetic monitoring with: - API checks: HTTP request monitoring with assertions - Browser checks: Playwright scripts running in real browsers - Alert channels: Slack, PagerDuty, email, webhooks - Multi-region: Run from multiple geographic locations simultaneously

Checkly API check (via Checkly CLI):

// checkly.config.ts
import { defineConfig } from "@checkly/cli";

export default defineConfig({
  projectName: "order-service-checks",
  logicalId: "order-service-checks",
  checks: {
    frequency: 1,     // minutes between checks
    locations: ["us-east-1", "eu-west-1", "ap-southeast-1"],
    tags: ["production"],
    runtimeId: "2024.02",
  },
  cli: {
    runLocation: "us-east-1",
  },
});
// checks/api-health.check.ts
import { ApiCheck, AssertionBuilder } from "@checkly/cli/constructs";

new ApiCheck("order-service-health", {
  name: "Order Service Health Check",
  request: {
    url: "https://api.example.com/health",
    method: "GET",
    assertions: [
      AssertionBuilder.statusCode().equals(200),
      AssertionBuilder.jsonBody("$.status").equals("ok"),
      AssertionBuilder.responseTime().lessThan(500),
    ],
  },
  frequency: 1,
  locations: ["us-east-1", "eu-west-1", "ap-southeast-1"],
  alertChannels: [pagerduty],
});
// checks/checkout-flow.check.ts — Browser check with Playwright
import { BrowserCheck } from "@checkly/cli/constructs";

new BrowserCheck("checkout-flow", {
  name: "Checkout Flow",
  code: {
    entrypoint: "./scripts/checkout.spec.ts",
  },
  frequency: 10,  // every 10 minutes
  locations: ["us-east-1", "eu-west-1"],
});
// scripts/checkout.spec.ts
import { test, expect } from "@playwright/test";

test("complete checkout flow", async ({ page }) => {
  await page.goto("https://shop.example.com");

  // Add item to cart
  await page.click('[data-testid="add-to-cart-button"]');
  await expect(page.locator('[data-testid="cart-count"]')).toHaveText("1");

  // Proceed to checkout
  await page.click('[data-testid="checkout-button"]');
  await expect(page).toHaveURL(/checkout/);

  // Verify payment form loads
  await expect(page.locator('[data-testid="payment-form"]')).toBeVisible();
});

9. k6 as Synthetic Runner

k6 can run as a scheduled synthetic check using k6 Cloud or Grafana Cloud k6:

// k6/api-probe.js
import http from "k6/http";
import { check, sleep } from "k6";

export const options = {
  // Synthetic monitoring: 1 VU, 1 iteration, run on schedule
  vus: 1,
  iterations: 1,
  thresholds: {
    http_req_failed: ["rate<0.01"],           // <1% errors
    http_req_duration: ["p(95)<500"],          // 95th percentile < 500ms
    "http_req_duration{endpoint:health}": ["max<200"],  // health < 200ms always
  },
};

export default function () {
  // Health check
  const healthRes = http.get("https://api.example.com/health", {
    tags: { endpoint: "health" },
  });
  check(healthRes, {
    "health returns 200": (r) => r.status === 200,
    "health body has ok status": (r) => r.json("status") === "ok",
  });

  // API endpoint check
  const apiRes = http.post(
    "https://api.example.com/v1/orders",
    JSON.stringify({ test: true }),
    { headers: { "Content-Type": "application/json" } }
  );
  check(apiRes, {
    "orders API returns 200 or 422": (r) =>
      r.status === 200 || r.status === 422,
  });

  sleep(1);
}

10. Monitoring From Multiple Geographic Regions

Regional monitoring catches CDN issues, routing problems, and regional outages:

# Prometheus with multiple Blackbox Exporter instances in different regions
scrape_configs:
  - job_name: "blackbox-us-east"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com/health
        labels:
          probe_region: us-east-1
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-us-east:9115

  - job_name: "blackbox-eu-west"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com/health
        labels:
          probe_region: eu-west-1
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-eu-west:9115

Alert only when multiple regions fail (avoids single-region flap):

# Alert: endpoint down from 2+ regions
- alert: EndpointDownMultiRegion
  expr: |
    count by (instance) (
      probe_success{job=~"blackbox-.*"} == 0
    ) >= 2
  for: 2m
  labels:
    severity: critical

11. Alerting Thresholds for Availability SLOs

Calibrate probe frequency and alert for duration to match your SLO:

SLO Target Max Downtime/Month Probe Interval Alert for
99.0% 7.2 hours 60s 5m
99.5% 3.6 hours 60s 3m
99.9% 43 minutes 30s 2m
99.99% 4.3 minutes 10s 1m

For 99.9% SLO: probing every 30 seconds with for: 2m means you detect an outage within ~2.5 minutes (2 minutes for duration + time for first probe after outage starts). That consumes about 6% of your monthly 43-minute budget per incident. This is acceptable for most services.

12. Common False Positive Patterns

Synthetic monitoring is prone to false positives that cause alert fatigue:

Network flaps: A single probe failure followed by success. Always use for: 2m minimum on availability alerts.

Probe timeouts under load: Your endpoint is slow but reachable during traffic peaks. The probe times out and fires a false alert. Solution: set probe timeout longer than your SLO latency threshold:

modules:
  http_2xx:
    prober: http
    timeout: 15s  # generous timeout — we only care about total unavailability
    http:
      valid_status_codes: []

TLS certificate renewal race: Let's Encrypt renews within 30 days of expiry. Your probe fires a warning, cert renews the next day, alert resolves. Expected behavior — but ensure your 30-day warning is routing to a ticket, not paging.

War story: A common outage pattern: a team sets up Blackbox probes against their /health endpoint, which returns 200 as long as the web server process is alive. The database goes down, the app returns 500 on every real request, but /health still returns 200 because it only checks "am I running?" Lesson: always probe a path that exercises the real request path, or use fail_if_body_not_matches_regexp to validate the response content.

IP-based probes hitting load balancer health check endpoint: Your probe hits /health which returns 200 regardless of backend state. You think the service is up; actually the backends are all down and the load balancer is returning cached health responses. Solution: probe a real user-visible endpoint, not a health-check endpoint that bypasses the stack.


Quick Reference

Task Command / Config
Test a blackbox probe manually curl "http://blackbox:9115/probe?target=https://example.com&module=http_2xx"
View available modules curl http://blackbox:9115/
Check probe_success for all endpoints PromQL: probe_success
Days until cert expiry PromQL: (probe_ssl_earliest_cert_expiry - time()) / 86400
30-day availability % PromQL: avg_over_time(probe_success[30d]) * 100
Install Blackbox Helm helm install blackbox-exporter prometheus-community/prometheus-blackbox-exporter -n monitoring
Checkly CLI deploy npx checkly deploy
k6 run synthetic probe k6 run k6/api-probe.js

Wiki Navigation

Prerequisites