Quiz: Monitoring Fundamentals¶

8 questions

L0 (2 questions)¶

1. What are the Four Golden Signals from Google SRE and what does each measure?

Show answer

Latency: how long requests take. Traffic: how many requests per second. Errors: how many requests fail. Saturation: how full your resources are. These four signals tell you if a service is available, fast, correct, and has capacity. They apply to any service regardless of monitoring tool.

2. What is the difference between blackbox and whitebox monitoring?

Show answer

Blackbox monitoring tests externally visible behavior (HTTP health checks, ping, synthetic transactions) — sees what users see. Whitebox monitoring uses internal instrumentation (application metrics, logs, traces) — sees internal state. You need both: blackbox catches issues whitebox misses (DNS failures, network path problems), whitebox catches internal problems before they become user-visible (queue depth growing, cache hit rate dropping).

L1 (3 questions)¶

1. What are the three pillars of observability?

Show answer

Metrics (numeric time-series), Logs (discrete events), Traces (request flow across services). Each answers different questions; you need all three.

2. What is the difference between Prometheus counter and gauge metric types, and when do you use rate() vs direct query?

Show answer

A counter only goes up (resets on restart) — use rate() to get per-second values (e.g., rate(http_requests_total[5m])). A gauge goes up and down — query directly for current value (e.g., node_memory_MemFree_bytes). Never use rate() on a gauge. Counters are for things you count (requests, errors, bytes). Gauges are for things you measure (temperature, queue depth, memory).

3. What are the Four Golden Signals and which services does each apply to?

Show answer

From Google SRE:
1. Latency — time to service a request (distinguish success vs error latency).
2. Traffic — demand on the system (requests/sec, sessions).
3. Errors — rate of failed requests (explicit 5xx and implicit: wrong content, slow response).
4. Saturation — how full the system is (CPU, memory, disk, queue depth). Apply to every user-facing service. For storage/batch systems, saturation and errors are most critical.

L2 (2 questions)¶

1. What makes a good alert vs a bad alert? Give specific characteristics of each.

Show answer

Good alerts are: actionable (someone needs to do something), timely (fires before customer impact), relevant (affects users or SLO), and symptom-based (e.g., error rate > 5%). Bad alerts are: noisy (fires constantly, gets ignored), unactionable ('CPU at 60%' — so what?), stale (threshold from 2018 on 2026 hardware), and cause-based without business context. Focus on symptoms over causes.

2. How do you instrument a new service for observability from scratch?

Show answer

1. Add a metrics library (prometheus-client, OpenTelemetry SDK).
2. Instrument RED metrics: request counter with status label, histogram for latency, error counter.
3. Add structured logging (JSON) with request ID, trace ID, and key fields.
4. Add distributed tracing (propagate trace context, create spans for external calls).
5. Expose /metrics endpoint.
6. Create ServiceMonitor for Prometheus scraping.
7. Build a dashboard with RED panels.
8. Define SLO-based alerts (error budget burn rate).

L3 (1 questions)¶

1. Why should you monitor your monitoring system, and how would you implement this?

Show answer

If your Nagios/Prometheus crashes, nobody gets alerted because the alerting system is the thing that died. Implement external monitoring: a separate lightweight check (e.g., Uptime Robot, Pingdom, or a cron job on a different server) that pings your monitoring system's health endpoint. If it gets no response, it alerts via a completely independent channel (SMS, separate PagerDuty account). This is the dead-man's switch for your monitoring.