Portal | Level: L1: Foundations | Topics: Synthetic Monitoring | Domain: Observability
Synthetic Monitoring — Primer¶
Why This Matters¶
Real User Monitoring (RUM) tells you what your users experienced. Synthetic monitoring tells you what they would experience right now, before they try. You send automated probes to your endpoints every 30 seconds and alert on failures — without waiting for a user to hit the broken endpoint and complain.
This is especially important for: - Detecting outages before users notice: A probe hitting your login page every 30 seconds catches it before your support inbox fills up - External availability: Your service might look healthy internally (Kubernetes pods running, internal metrics fine) but be unreachable from the internet due to a firewall rule, DNS change, or certificate expiry - Certificate expiry: Synthetic probes can alert 30 days before a certificate expires — before it causes an outage - Dependency health: Probing your database connectivity, cache, and downstream APIs to confirm they are reachable — not just that Kubernetes says the pods are running
The Prometheus Blackbox Exporter is the standard open-source tool. Checkly and k6 cover browser-based and scripted checks.
Core Concepts¶
1. Real User Monitoring vs Synthetic Monitoring¶
| Dimension | Real User Monitoring (RUM) | Synthetic Monitoring |
|---|---|---|
| Data source | Actual user traffic | Automated probes |
| Latency to detect | Seconds to minutes after user impact | Near-real-time (probe interval) |
| Pre-traffic detection | No — needs users | Yes — probes run without traffic |
| Coverage | Only pages users visit | Any endpoint, any time |
| Browser simulation | Yes | Optional (Checkly, Playwright) |
| Off-hours monitoring | Only if users active | Always |
| Cost | Low (piggybacks on traffic) | Per-probe compute cost |
| Best for | User experience data | Availability monitoring |
Both are complementary. RUM shows you the real user experience distribution; synthetic shows you external availability.
2. Blackbox Exporter — Overview¶
Under the hood: The Blackbox Exporter uses a "multi-target" pattern unique among Prometheus exporters. Instead of scraping a single endpoint, Prometheus passes the target URL as a query parameter to the exporter, which probes it on demand. This means one exporter instance can monitor thousands of endpoints without any per-target configuration in the exporter itself.
The Prometheus Blackbox Exporter probes endpoints and returns metrics. Prometheus scrapes the exporter, which runs a probe on demand and returns results:
┌──────────────┐ ┌────────────────────┐ ┌────────────┐
│ Prometheus │──scrape─▶ Blackbox Exporter │──probe─▶ Target │
│ │◀─metrics─│ (HTTP/TCP/ICMP/DNS)│◀──resp─│ Endpoint │
└──────────────┘ └────────────────────┘ └────────────┘
Prometheus sends the target URL as a parameter in the scrape request. The exporter probes it and returns probe_success (0 or 1) plus timing metrics.
Install Blackbox Exporter:
# Docker
docker run -d \
--name blackbox-exporter \
-p 9115:9115 \
-v $(pwd)/blackbox.yml:/etc/blackbox_exporter/config.yml \
prom/blackbox-exporter:latest
# Kubernetes (via Prometheus community Helm chart)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install blackbox-exporter \
prometheus-community/prometheus-blackbox-exporter \
--namespace monitoring \
--set config.modules.http_2xx.prober=http \
--set config.modules.http_2xx.http.valid_status_codes={}
# Binary
wget https://github.com/prometheus/blackbox_exporter/releases/latest/download/blackbox_exporter-linux-amd64.tar.gz
tar -xzf blackbox_exporter-*.tar.gz
./blackbox_exporter --config.file=blackbox.yml
3. Blackbox Exporter — Prober Modules¶
Blackbox configuration file:
# blackbox.yml
modules:
# HTTP probe — checks for 2xx response
http_2xx:
prober: http
timeout: 10s
http:
valid_status_codes: [] # defaults to 2xx
method: GET
follow_redirects: true
preferred_ip_protocol: ip4
tls_config:
insecure_skip_verify: false
headers:
User-Agent: "BlackboxExporter/1.0"
# HTTP probe with authentication
http_2xx_auth:
prober: http
timeout: 10s
http:
valid_status_codes: [200]
method: GET
headers:
Authorization: "Bearer {{ .module.http_2xx_auth.bearer_token }}"
bearer_token_file: /etc/blackbox/token
# POST probe — API endpoint check
http_post_2xx:
prober: http
timeout: 10s
http:
method: POST
headers:
Content-Type: application/json
body: '{"healthcheck": true}'
valid_status_codes: [200, 201]
# TCP probe — checks port is open
tcp_connect:
prober: tcp
timeout: 5s
# ICMP probe — ping check (requires NET_RAW capability)
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: ip4
# DNS probe — checks domain resolves correctly
dns_soa:
prober: dns
timeout: 5s
dns:
transport_protocol: udp
preferred_ip_protocol: ip4
query_name: example.com
query_type: SOA
validate_answer_rrs:
fail_if_matches_regexp:
- ".*127.0.0.1.*" # alert if DNS poisoned
# HTTPS with certificate validation
https_2xx:
prober: http
timeout: 10s
http:
valid_status_codes: [200]
tls_config:
insecure_skip_verify: false
fail_if_ssl: false
fail_if_not_ssl: true # require HTTPS
4. Prometheus Scrape Config — Multi-Target Pattern¶
The canonical pattern for Blackbox Exporter uses the multi-target pattern: a single Prometheus job scrapes many targets through one exporter instance:
# prometheus.yml
scrape_configs:
# HTTP endpoint checks
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health
- https://app.example.com/login
- https://payments.example.com/status
relabel_configs:
# Move target URL to probe param
- source_labels: [__address__]
target_label: __param_target
# Set instance label to the URL being probed
- source_labels: [__param_target]
target_label: instance
# Send all requests to the exporter
- target_label: __address__
replacement: blackbox-exporter:9115
# TCP connectivity checks (database, cache, etc.)
- job_name: "blackbox-tcp"
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- postgres.internal:5432
- redis.internal:6379
- kafka.internal:9092
labels:
probe_type: dependency
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# ICMP ping checks for network nodes
- job_name: "blackbox-icmp"
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- 10.0.0.1 # default gateway
- 8.8.8.8 # external DNS
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
5. Key Blackbox Exporter Metrics¶
probe_success # 1 = probe succeeded, 0 = failed (most important)
probe_duration_seconds # total probe duration (useful for SLO)
probe_http_status_code # HTTP status code returned
probe_http_duration_seconds # HTTP phase breakdown (DNS, TCP, TLS, processing)
probe_http_ssl # 1 = connection used SSL
probe_ssl_earliest_cert_expiry # Unix timestamp of nearest expiring cert
probe_dns_lookup_time_seconds # DNS resolution time
probe_tcp_connect_duration_seconds # TCP connect time
probe_failed_due_to_regex # 1 if body validation regex failed
6. SSL Certificate Expiry Alerting¶
One of the most valuable uses of the Blackbox Exporter:
# Prometheus alerting rules for SSL certificates
groups:
- name: ssl.certificate
rules:
# Certificate expires in less than 7 days — page
- alert: SSLCertificateExpiryCritical
expr: |
probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 7 * 24 * 3600
for: 1h
labels:
severity: critical
annotations:
summary: "SSL certificate expires in < 7 days"
description: "Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
# Certificate expires in less than 30 days — warn
- alert: SSLCertificateExpiryWarning
expr: |
probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 30 * 24 * 3600
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expires in < 30 days"
description: "Certificate for {{ $labels.instance }} expires {{ $value | humanizeDuration }} from now. Renew before it expires."
# Endpoint is down
- alert: EndpointDown
expr: probe_success{job="blackbox-http"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Endpoint is unreachable"
description: "{{ $labels.instance }} has been unreachable for 2 minutes"
# Slow response time
- alert: SlowEndpoint
expr: probe_duration_seconds{job="blackbox-http"} > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Endpoint response time > 2 seconds"
description: "{{ $labels.instance }} is responding slowly: {{ $value }}s"
# Dependency TCP port unreachable
- alert: DependencyUnreachable
expr: probe_success{job="blackbox-tcp"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Dependency port unreachable"
description: "TCP probe to {{ $labels.instance }} ({{ $labels.probe_type }}) is failing"
7. Grafana Dashboard for Blackbox Exporter¶
Key panels for a Blackbox dashboard:
# Availability heatmap (for each endpoint, 0 or 1 over time)
probe_success{job="blackbox-http"}
# HTTP response time by endpoint
probe_duration_seconds{job="blackbox-http"}
# HTTP response breakdown (DNS, TCP, TLS, processing)
# DNS resolution time
probe_http_duration_seconds{phase="resolve"}
# TCP connect time
probe_http_duration_seconds{phase="connect"}
# TLS handshake time
probe_http_duration_seconds{phase="tls"}
# Server processing time
probe_http_duration_seconds{phase="processing"}
# SSL certificate expiry (days remaining)
(probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time()) / 86400
# 30-day availability percentage per endpoint
avg_over_time(probe_success{job="blackbox-http"}[30d]) * 100
# Current status of all endpoints (useful for status page panel)
min by (instance) (probe_success{job="blackbox-http"})
8. Checkly — Browser and API Checks¶
Checkly provides managed synthetic monitoring with: - API checks: HTTP request monitoring with assertions - Browser checks: Playwright scripts running in real browsers - Alert channels: Slack, PagerDuty, email, webhooks - Multi-region: Run from multiple geographic locations simultaneously
Checkly API check (via Checkly CLI):
// checkly.config.ts
import { defineConfig } from "@checkly/cli";
export default defineConfig({
projectName: "order-service-checks",
logicalId: "order-service-checks",
checks: {
frequency: 1, // minutes between checks
locations: ["us-east-1", "eu-west-1", "ap-southeast-1"],
tags: ["production"],
runtimeId: "2024.02",
},
cli: {
runLocation: "us-east-1",
},
});
// checks/api-health.check.ts
import { ApiCheck, AssertionBuilder } from "@checkly/cli/constructs";
new ApiCheck("order-service-health", {
name: "Order Service Health Check",
request: {
url: "https://api.example.com/health",
method: "GET",
assertions: [
AssertionBuilder.statusCode().equals(200),
AssertionBuilder.jsonBody("$.status").equals("ok"),
AssertionBuilder.responseTime().lessThan(500),
],
},
frequency: 1,
locations: ["us-east-1", "eu-west-1", "ap-southeast-1"],
alertChannels: [pagerduty],
});
// checks/checkout-flow.check.ts — Browser check with Playwright
import { BrowserCheck } from "@checkly/cli/constructs";
new BrowserCheck("checkout-flow", {
name: "Checkout Flow",
code: {
entrypoint: "./scripts/checkout.spec.ts",
},
frequency: 10, // every 10 minutes
locations: ["us-east-1", "eu-west-1"],
});
// scripts/checkout.spec.ts
import { test, expect } from "@playwright/test";
test("complete checkout flow", async ({ page }) => {
await page.goto("https://shop.example.com");
// Add item to cart
await page.click('[data-testid="add-to-cart-button"]');
await expect(page.locator('[data-testid="cart-count"]')).toHaveText("1");
// Proceed to checkout
await page.click('[data-testid="checkout-button"]');
await expect(page).toHaveURL(/checkout/);
// Verify payment form loads
await expect(page.locator('[data-testid="payment-form"]')).toBeVisible();
});
9. k6 as Synthetic Runner¶
k6 can run as a scheduled synthetic check using k6 Cloud or Grafana Cloud k6:
// k6/api-probe.js
import http from "k6/http";
import { check, sleep } from "k6";
export const options = {
// Synthetic monitoring: 1 VU, 1 iteration, run on schedule
vus: 1,
iterations: 1,
thresholds: {
http_req_failed: ["rate<0.01"], // <1% errors
http_req_duration: ["p(95)<500"], // 95th percentile < 500ms
"http_req_duration{endpoint:health}": ["max<200"], // health < 200ms always
},
};
export default function () {
// Health check
const healthRes = http.get("https://api.example.com/health", {
tags: { endpoint: "health" },
});
check(healthRes, {
"health returns 200": (r) => r.status === 200,
"health body has ok status": (r) => r.json("status") === "ok",
});
// API endpoint check
const apiRes = http.post(
"https://api.example.com/v1/orders",
JSON.stringify({ test: true }),
{ headers: { "Content-Type": "application/json" } }
);
check(apiRes, {
"orders API returns 200 or 422": (r) =>
r.status === 200 || r.status === 422,
});
sleep(1);
}
10. Monitoring From Multiple Geographic Regions¶
Regional monitoring catches CDN issues, routing problems, and regional outages:
# Prometheus with multiple Blackbox Exporter instances in different regions
scrape_configs:
- job_name: "blackbox-us-east"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.example.com/health
labels:
probe_region: us-east-1
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-us-east:9115
- job_name: "blackbox-eu-west"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.example.com/health
labels:
probe_region: eu-west-1
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-eu-west:9115
Alert only when multiple regions fail (avoids single-region flap):
# Alert: endpoint down from 2+ regions
- alert: EndpointDownMultiRegion
expr: |
count by (instance) (
probe_success{job=~"blackbox-.*"} == 0
) >= 2
for: 2m
labels:
severity: critical
11. Alerting Thresholds for Availability SLOs¶
Calibrate probe frequency and alert for duration to match your SLO:
| SLO Target | Max Downtime/Month | Probe Interval | Alert for |
|---|---|---|---|
| 99.0% | 7.2 hours | 60s | 5m |
| 99.5% | 3.6 hours | 60s | 3m |
| 99.9% | 43 minutes | 30s | 2m |
| 99.99% | 4.3 minutes | 10s | 1m |
For 99.9% SLO: probing every 30 seconds with for: 2m means you detect an outage within ~2.5 minutes (2 minutes for duration + time for first probe after outage starts). That consumes about 6% of your monthly 43-minute budget per incident. This is acceptable for most services.
12. Common False Positive Patterns¶
Synthetic monitoring is prone to false positives that cause alert fatigue:
Network flaps: A single probe failure followed by success. Always use for: 2m minimum on availability alerts.
Probe timeouts under load: Your endpoint is slow but reachable during traffic peaks. The probe times out and fires a false alert. Solution: set probe timeout longer than your SLO latency threshold:
modules:
http_2xx:
prober: http
timeout: 15s # generous timeout — we only care about total unavailability
http:
valid_status_codes: []
TLS certificate renewal race: Let's Encrypt renews within 30 days of expiry. Your probe fires a warning, cert renews the next day, alert resolves. Expected behavior — but ensure your 30-day warning is routing to a ticket, not paging.
War story: A common outage pattern: a team sets up Blackbox probes against their
/healthendpoint, which returns 200 as long as the web server process is alive. The database goes down, the app returns 500 on every real request, but/healthstill returns 200 because it only checks "am I running?" Lesson: always probe a path that exercises the real request path, or usefail_if_body_not_matches_regexpto validate the response content.
IP-based probes hitting load balancer health check endpoint: Your probe hits /health which returns 200 regardless of backend state. You think the service is up; actually the backends are all down and the load balancer is returning cached health responses. Solution: probe a real user-visible endpoint, not a health-check endpoint that bypasses the stack.
Quick Reference¶
| Task | Command / Config |
|---|---|
| Test a blackbox probe manually | curl "http://blackbox:9115/probe?target=https://example.com&module=http_2xx" |
| View available modules | curl http://blackbox:9115/ |
| Check probe_success for all endpoints | PromQL: probe_success |
| Days until cert expiry | PromQL: (probe_ssl_earliest_cert_expiry - time()) / 86400 |
| 30-day availability % | PromQL: avg_over_time(probe_success[30d]) * 100 |
| Install Blackbox Helm | helm install blackbox-exporter prometheus-community/prometheus-blackbox-exporter -n monitoring |
| Checkly CLI deploy | npx checkly deploy |
| k6 run synthetic probe | k6 run k6/api-probe.js |
Wiki Navigation¶
Prerequisites¶
- Observability Deep Dive (Topic Pack, L2)