Portal | Level: L1: Foundations | Topics: Synthetic Monitoring | Domain: Observability

Synthetic Monitoring — Primer¶

Why This Matters¶

Real User Monitoring (RUM) tells you what your users experienced. Synthetic monitoring tells you what they would experience right now, before they try. You send automated probes to your endpoints every 30 seconds and alert on failures — without waiting for a user to hit the broken endpoint and complain.

This is especially important for: - Detecting outages before users notice: A probe hitting your login page every 30 seconds catches it before your support inbox fills up - External availability: Your service might look healthy internally (Kubernetes pods running, internal metrics fine) but be unreachable from the internet due to a firewall rule, DNS change, or certificate expiry - Certificate expiry: Synthetic probes can alert 30 days before a certificate expires — before it causes an outage - Dependency health: Probing your database connectivity, cache, and downstream APIs to confirm they are reachable — not just that Kubernetes says the pods are running

The Prometheus Blackbox Exporter is the standard open-source tool. Checkly and k6 cover browser-based and scripted checks.

Core Concepts¶

1. Real User Monitoring vs Synthetic Monitoring¶

Dimension	Real User Monitoring (RUM)	Synthetic Monitoring
Data source	Actual user traffic	Automated probes
Latency to detect	Seconds to minutes after user impact	Near-real-time (probe interval)
Pre-traffic detection	No — needs users	Yes — probes run without traffic
Coverage	Only pages users visit	Any endpoint, any time
Browser simulation	Yes	Optional (Checkly, Playwright)
Off-hours monitoring	Only if users active	Always
Cost	Low (piggybacks on traffic)	Per-probe compute cost
Best for	User experience data	Availability monitoring

Both are complementary. RUM shows you the real user experience distribution; synthetic shows you external availability.

2. Blackbox Exporter — Overview¶

Under the hood: The Blackbox Exporter uses a "multi-target" pattern unique among Prometheus exporters. Instead of scraping a single endpoint, Prometheus passes the target URL as a query parameter to the exporter, which probes it on demand. This means one exporter instance can monitor thousands of endpoints without any per-target configuration in the exporter itself.

The Prometheus Blackbox Exporter probes endpoints and returns metrics. Prometheus scrapes the exporter, which runs a probe on demand and returns results:

┌──────────────┐        ┌────────────────────┐       ┌────────────┐
│  Prometheus  │──scrape─▶  Blackbox Exporter  │──probe─▶  Target    │
│              │◀─metrics─│  (HTTP/TCP/ICMP/DNS)│◀──resp─│  Endpoint  │
└──────────────┘        └────────────────────┘       └────────────┘

Prometheus sends the target URL as a parameter in the scrape request. The exporter probes it and returns probe_success (0 or 1) plus timing metrics.

Install Blackbox Exporter:

# Docker
docker run -d \
  --name blackbox-exporter \
  -p 9115:9115 \
  -v $(pwd)/blackbox.yml:/etc/blackbox_exporter/config.yml \
  prom/blackbox-exporter:latest

# Kubernetes (via Prometheus community Helm chart)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install blackbox-exporter \
  prometheus-community/prometheus-blackbox-exporter \
  --namespace monitoring \
  --set config.modules.http_2xx.prober=http \
  --set config.modules.http_2xx.http.valid_status_codes={}

# Binary
wget https://github.com/prometheus/blackbox_exporter/releases/latest/download/blackbox_exporter-linux-amd64.tar.gz
tar -xzf blackbox_exporter-*.tar.gz
./blackbox_exporter --config.file=blackbox.yml

3. Blackbox Exporter — Prober Modules¶

Blackbox configuration file:

# blackbox.yml
modules:
  # HTTP probe — checks for 2xx response
  http_2xx:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: []  # defaults to 2xx
      method: GET
      follow_redirects: true
      preferred_ip_protocol: ip4
      tls_config:
        insecure_skip_verify: false
      headers:
        User-Agent: "BlackboxExporter/1.0"

  # HTTP probe with authentication
  http_2xx_auth:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: [200]
      method: GET
      headers:
        Authorization: "Bearer {{ .module.http_2xx_auth.bearer_token }}"
      bearer_token_file: /etc/blackbox/token

  # POST probe — API endpoint check
  http_post_2xx:
    prober: http
    timeout: 10s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"healthcheck": true}'
      valid_status_codes: [200, 201]

  # TCP probe — checks port is open
  tcp_connect:
    prober: tcp
    timeout: 5s

  # ICMP probe — ping check (requires NET_RAW capability)
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: ip4

  # DNS probe — checks domain resolves correctly
  dns_soa:
    prober: dns
    timeout: 5s
    dns:
      transport_protocol: udp
      preferred_ip_protocol: ip4
      query_name: example.com
      query_type: SOA
      validate_answer_rrs:
        fail_if_matches_regexp:
          - ".*127.0.0.1.*"  # alert if DNS poisoned

  # HTTPS with certificate validation
  https_2xx:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: [200]
      tls_config:
        insecure_skip_verify: false
      fail_if_ssl: false
      fail_if_not_ssl: true  # require HTTPS

4. Prometheus Scrape Config — Multi-Target Pattern¶

The canonical pattern for Blackbox Exporter uses the multi-target pattern: a single Prometheus job scrapes many targets through one exporter instance:

# prometheus.yml

scrape_configs:
  # HTTP endpoint checks
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
          - https://app.example.com/login
          - https://payments.example.com/status
    relabel_configs:
      # Move target URL to probe param
      - source_labels: [__address__]
        target_label: __param_target
      # Set instance label to the URL being probed
      - source_labels: [__param_target]
        target_label: instance
      # Send all requests to the exporter
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # TCP connectivity checks (database, cache, etc.)
  - job_name: "blackbox-tcp"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
          - postgres.internal:5432
          - redis.internal:6379
          - kafka.internal:9092
        labels:
          probe_type: dependency
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # ICMP ping checks for network nodes
  - job_name: "blackbox-icmp"
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
          - 10.0.0.1   # default gateway
          - 8.8.8.8    # external DNS
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

5. Key Blackbox Exporter Metrics¶

probe_success                    # 1 = probe succeeded, 0 = failed (most important)
probe_duration_seconds           # total probe duration (useful for SLO)
probe_http_status_code           # HTTP status code returned
probe_http_duration_seconds      # HTTP phase breakdown (DNS, TCP, TLS, processing)
probe_http_ssl                   # 1 = connection used SSL
probe_ssl_earliest_cert_expiry   # Unix timestamp of nearest expiring cert
probe_dns_lookup_time_seconds    # DNS resolution time
probe_tcp_connect_duration_seconds  # TCP connect time
probe_failed_due_to_regex        # 1 if body validation regex failed

6. SSL Certificate Expiry Alerting¶

One of the most valuable uses of the Blackbox Exporter:

# Prometheus alerting rules for SSL certificates
groups:
  - name: ssl.certificate
    rules:
      # Certificate expires in less than 7 days — page
      - alert: SSLCertificateExpiryCritical
        expr: |
          probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 7 * 24 * 3600
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "SSL certificate expires in < 7 days"
          description: "Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"

      # Certificate expires in less than 30 days — warn
      - alert: SSLCertificateExpiryWarning
        expr: |
          probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 30 * 24 * 3600
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expires in < 30 days"
          description: "Certificate for {{ $labels.instance }} expires {{ $value | humanizeDuration }} from now. Renew before it expires."

      # Endpoint is down
      - alert: EndpointDown
        expr: probe_success{job="blackbox-http"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Endpoint is unreachable"
          description: "{{ $labels.instance }} has been unreachable for 2 minutes"

      # Slow response time
      - alert: SlowEndpoint
        expr: probe_duration_seconds{job="blackbox-http"} > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Endpoint response time > 2 seconds"
          description: "{{ $labels.instance }} is responding slowly: {{ $value }}s"

      # Dependency TCP port unreachable
      - alert: DependencyUnreachable
        expr: probe_success{job="blackbox-tcp"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Dependency port unreachable"
          description: "TCP probe to {{ $labels.instance }} ({{ $labels.probe_type }}) is failing"

7. Grafana Dashboard for Blackbox Exporter¶

Key panels for a Blackbox dashboard:

# Availability heatmap (for each endpoint, 0 or 1 over time)
probe_success{job="blackbox-http"}

# HTTP response time by endpoint
probe_duration_seconds{job="blackbox-http"}

# HTTP response breakdown (DNS, TCP, TLS, processing)
# DNS resolution time
probe_http_duration_seconds{phase="resolve"}
# TCP connect time
probe_http_duration_seconds{phase="connect"}
# TLS handshake time
probe_http_duration_seconds{phase="tls"}
# Server processing time
probe_http_duration_seconds{phase="processing"}

# SSL certificate expiry (days remaining)
(probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time()) / 86400

# 30-day availability percentage per endpoint
avg_over_time(probe_success{job="blackbox-http"}[30d]) * 100

# Current status of all endpoints (useful for status page panel)
min by (instance) (probe_success{job="blackbox-http"})

8. Checkly — Browser and API Checks¶

Checkly provides managed synthetic monitoring with: - API checks: HTTP request monitoring with assertions - Browser checks: Playwright scripts running in real browsers - Alert channels: Slack, PagerDuty, email, webhooks - Multi-region: Run from multiple geographic locations simultaneously

Checkly API check (via Checkly CLI):

// checkly.config.ts
import { defineConfig } from "@checkly/cli";

export default defineConfig({
  projectName: "order-service-checks",
  logicalId: "order-service-checks",
  checks: {
    frequency: 1,     // minutes between checks
    locations: ["us-east-1", "eu-west-1", "ap-southeast-1"],
    tags: ["production"],
    runtimeId: "2024.02",
  },
  cli: {
    runLocation: "us-east-1",
  },
});

// checks/api-health.check.ts
import { ApiCheck, AssertionBuilder } from "@checkly/cli/constructs";

new ApiCheck("order-service-health", {
  name: "Order Service Health Check",
  request: {
    url: "https://api.example.com/health",
    method: "GET",
    assertions: [
      AssertionBuilder.statusCode().equals(200),
      AssertionBuilder.jsonBody("$.status").equals("ok"),
      AssertionBuilder.responseTime().lessThan(500),
    ],
  },
  frequency: 1,
  locations: ["us-east-1", "eu-west-1", "ap-southeast-1"],
  alertChannels: [pagerduty],
});

// checks/checkout-flow.check.ts — Browser check with Playwright
import { BrowserCheck } from "@checkly/cli/constructs";

new BrowserCheck("checkout-flow", {
  name: "Checkout Flow",
  code: {
    entrypoint: "./scripts/checkout.spec.ts",
  },
  frequency: 10,  // every 10 minutes
  locations: ["us-east-1", "eu-west-1"],
});

// scripts/checkout.spec.ts
import { test, expect } from "@playwright/test";

test("complete checkout flow", async ({ page }) => {
  await page.goto("https://shop.example.com");

  // Add item to cart
  await page.click('[data-testid="add-to-cart-button"]');
  await expect(page.locator('[data-testid="cart-count"]')).toHaveText("1");

  // Proceed to checkout
  await page.click('[data-testid="checkout-button"]');
  await expect(page).toHaveURL(/checkout/);

  // Verify payment form loads
  await expect(page.locator('[data-testid="payment-form"]')).toBeVisible();
});

9. k6 as Synthetic Runner¶

k6 can run as a scheduled synthetic check using k6 Cloud or Grafana Cloud k6:

// k6/api-probe.js
import http from "k6/http";
import { check, sleep } from "k6";

export const options = {
  // Synthetic monitoring: 1 VU, 1 iteration, run on schedule
  vus: 1,
  iterations: 1,
  thresholds: {
    http_req_failed: ["rate<0.01"],           // <1% errors
    http_req_duration: ["p(95)<500"],          // 95th percentile < 500ms
    "http_req_duration{endpoint:health}": ["max<200"],  // health < 200ms always
  },
};

export default function () {
  // Health check
  const healthRes = http.get("https://api.example.com/health", {
    tags: { endpoint: "health" },
  });
  check(healthRes, {
    "health returns 200": (r) => r.status === 200,
    "health body has ok status": (r) => r.json("status") === "ok",
  });

  // API endpoint check
  const apiRes = http.post(
    "https://api.example.com/v1/orders",
    JSON.stringify({ test: true }),
    { headers: { "Content-Type": "application/json" } }
  );
  check(apiRes, {
    "orders API returns 200 or 422": (r) =>
      r.status === 200 || r.status === 422,
  });

  sleep(1);
}

10. Monitoring From Multiple Geographic Regions¶

Regional monitoring catches CDN issues, routing problems, and regional outages:

# Prometheus with multiple Blackbox Exporter instances in different regions
scrape_configs:
  - job_name: "blackbox-us-east"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com/health
        labels:
          probe_region: us-east-1
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-us-east:9115

  - job_name: "blackbox-eu-west"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com/health
        labels:
          probe_region: eu-west-1
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-eu-west:9115

Alert only when multiple regions fail (avoids single-region flap):

# Alert: endpoint down from 2+ regions
- alert: EndpointDownMultiRegion
  expr: |
    count by (instance) (
      probe_success{job=~"blackbox-.*"} == 0
    ) >= 2
  for: 2m
  labels:
    severity: critical

11. Alerting Thresholds for Availability SLOs¶

Calibrate probe frequency and alert for duration to match your SLO:

SLO Target	Max Downtime/Month	Probe Interval	Alert `for`
99.0%	7.2 hours	60s	5m
99.5%	3.6 hours	60s	3m
99.9%	43 minutes	30s	2m
99.99%	4.3 minutes	10s	1m

For 99.9% SLO: probing every 30 seconds with for: 2m means you detect an outage within ~2.5 minutes (2 minutes for duration + time for first probe after outage starts). That consumes about 6% of your monthly 43-minute budget per incident. This is acceptable for most services.

12. Common False Positive Patterns¶

Synthetic monitoring is prone to false positives that cause alert fatigue:

Network flaps: A single probe failure followed by success. Always use for: 2m minimum on availability alerts.

Probe timeouts under load: Your endpoint is slow but reachable during traffic peaks. The probe times out and fires a false alert. Solution: set probe timeout longer than your SLO latency threshold:

modules:
  http_2xx:
    prober: http
    timeout: 15s  # generous timeout — we only care about total unavailability
    http:
      valid_status_codes: []

TLS certificate renewal race: Let's Encrypt renews within 30 days of expiry. Your probe fires a warning, cert renews the next day, alert resolves. Expected behavior — but ensure your 30-day warning is routing to a ticket, not paging.

War story: A common outage pattern: a team sets up Blackbox probes against their /health endpoint, which returns 200 as long as the web server process is alive. The database goes down, the app returns 500 on every real request, but /health still returns 200 because it only checks "am I running?" Lesson: always probe a path that exercises the real request path, or use fail_if_body_not_matches_regexp to validate the response content.

IP-based probes hitting load balancer health check endpoint: Your probe hits /health which returns 200 regardless of backend state. You think the service is up; actually the backends are all down and the load balancer is returning cached health responses. Solution: probe a real user-visible endpoint, not a health-check endpoint that bypasses the stack.

Quick Reference¶

Task	Command / Config
Test a blackbox probe manually	`curl "http://blackbox:9115/probe?target=https://example.com&module=http_2xx"`
View available modules	`curl http://blackbox:9115/`
Check probe_success for all endpoints	PromQL: `probe_success`
Days until cert expiry	PromQL: `(probe_ssl_earliest_cert_expiry - time()) / 86400`
30-day availability %	PromQL: `avg_over_time(probe_success[30d]) * 100`
Install Blackbox Helm	`helm install blackbox-exporter prometheus-community/prometheus-blackbox-exporter -n monitoring`
Checkly CLI deploy	`npx checkly deploy`
k6 run synthetic probe	`k6 run k6/api-probe.js`

Prerequisites¶

Observability Deep Dive (Topic Pack, L2)