Observability¶

31 cards — 🟢 5 easy | 🟡 15 medium | 🔴 5 hard

🟢 Easy (5)¶

1. What are the three pillars of observability?

Show answer

Metrics (numeric time-series), Logs (discrete events with context), Traces (request flow across services). Each answers different questions: metrics show what is broken, logs show why, traces show where in the call chain.

Remember: Three pillars: Metrics, Logs, Traces. "MLT."

2. When should you use a metric vs a log?

Show answer

Use metrics for aggregatable numeric data you alert on (request rate, error rate, latency). Use logs for discrete events with rich context (stack traces, request payloads). Metrics are cheap to query at scale; logs are expensive.

Remember: Counter(cumulative), Gauge(current), Histogram(distribution).

3. How does Prometheus collect metrics?

Show answer

Prometheus uses a pull (scrape) model. It periodically HTTP GETs the /metrics endpoint of each target. Targets expose metrics in the Prometheus exposition format. This is the opposite of push-based systems like StatsD.

Remember: Counter(cumulative), Gauge(current), Histogram(distribution).

4. What is the difference between SLI, SLO, and SLA?

Show answer

SLI (indicator): a measured metric like request latency p99. SLO (objective): target value for the SLI, e.g., p99 < 300ms. SLA (agreement): contractual commitment with consequences if SLO is breached. SLI measures, SLO sets the goal, SLA is the contract.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

5. When should you use a dashboard vs an alert?

Show answer

Dashboards are for investigation and trend analysis (pull). Alerts are for immediate notification of problems (push). Good rule: if it needs human attention right now, alert. If it provides context during investigation, dashboard.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

🟡 Medium (15)¶

1. What problem do distributed traces solve that metrics and logs alone cannot?

Show answer

Traces show causality across service boundaries. When a request touches 5 services, traces reveal which hop is slow or failing, even when each service's metrics look fine individually.

Remember: Distributed tracing = follow request across services. Span = segment.

Example: Jaeger, Zipkin, OpenTelemetry. Trace ID in HTTP headers.

2. What is a Prometheus exporter and when do you need one?

Show answer

An exporter is a sidecar or standalone process that translates metrics from a non-Prometheus system into the Prometheus exposition format. You need one when the application cannot natively expose /metrics (e.g., node_exporter for OS metrics, blackbox_exporter for probing).

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

3. What is label cardinality and why is it dangerous in Prometheus?

Show answer

Cardinality is the number of unique time-series created by label combinations. High-cardinality labels (user IDs, request paths) create millions of series, consuming memory and slowing queries. Fix with relabeling, recording rules, or dropping unbounded labels at the source.

Gotcha: High cardinality = #1 perf killer. Never user IDs as metric labels.

4. What is the difference between a counter and a gauge?

Show answer

A counter only goes up (resets on restart): total requests, total errors. A gauge goes up and down: current memory, queue depth. Always use rate() on counters; use gauge values directly.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

5. When do you use a histogram vs a summary?

Show answer

Use histograms for latency percentiles when you need server-side aggregation across instances (histogram_quantile). Summaries compute quantiles client-side so they cannot be aggregated. Prefer histograms in almost all cases.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

6. What is an error budget and how does it guide decision-making?

Show answer

Error budget = 1 minus SLO target (e.g., 99.9% SLO gives 0.1% budget). When budget is healthy, ship features fast. When budget is nearly spent, freeze risky changes and focus on reliability. It aligns dev velocity with reliability.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

7. What is alert fatigue and how do you combat it?

Show answer

Alert fatigue occurs when too many non-actionable alerts cause responders to ignore or mute them. Fix: alert on symptoms not causes, require every alert to be actionable and have a runbook, tune thresholds, deduplicate via grouping, and regularly prune alerts that nobody acts on.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

8. How does alert routing work in Alertmanager?

Show answer

Alertmanager receives alerts from Prometheus and routes them based on label matchers. Routes form a tree: alerts match the most specific route. Each route can set a receiver (Slack, PagerDuty), group_by labels, and timing (group_wait, group_interval, repeat_interval).

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

9. What is the difference between blackbox and whitebox monitoring?

Show answer

Blackbox monitoring tests externally visible behavior (HTTP probe returns 200, TCP connect succeeds). Whitebox monitoring uses internal instrumentation (request latency histogram, error counters). Use blackbox for user-facing SLIs; whitebox for debugging and capacity planning.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

10. How does blackbox_exporter work?

Show answer

Prometheus scrapes blackbox_exporter with a target parameter. The exporter probes the target (HTTP, TCP, ICMP, DNS) and returns success/failure, latency, TLS expiry, etc. as metrics. Useful for monitoring endpoints you don't control.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

11. Grafana dashboard shows No Data. What do you check?

Show answer

1) Data source connection: Settings > Data Sources > Test. 2) Time range: too narrow or shifted. 3) Query syntax: run in Explore tab. 4) Metric name changed or labels don't match. 5) Prometheus retention: data may have expired.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

12. A Prometheus target shows as DOWN but the app is healthy. What do you check?

Show answer

1) ServiceMonitor selector vs Service labels. 2) Port name in ServiceMonitor matches Service port name. 3) Metrics endpoint path is correct (/metrics). 4) Network policy blocking Prometheus from scraping. 5) Prometheus namespace selector config.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

13. What are recording rules and why use them?

Show answer

Recording rules pre-compute frequently used or expensive PromQL expressions and store the result as a new time-series. Use them to speed up dashboard queries, reduce query-time load, and create stable metric names for alerting rules.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

14. Why use structured logging over plain text?

Show answer

Structured logs (JSON) have consistent fields that can be parsed, indexed, and queried without regex. Enables filtering by severity, request ID, user, service. Plain text requires fragile pattern matching and breaks when log format changes.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

15. Explain the RED and USE methods for monitoring.

Show answer

RED (for services): Rate, Errors, Duration per service endpoint. USE (for resources): Utilization, Saturation, Errors per resource (CPU, disk, network). RED tells you about user experience; USE tells you about infrastructure health.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

🔴 Hard (5)¶

1. What makes a good production alert?

Show answer

1) Actionable: someone must do something now. 2) Urgent: cannot wait until business hours. 3) Real: low false-positive rate. 4) Symptom-based: alert on error rate, not CPU. 5) Includes runbook link. 6) Has clear ownership.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

2. Prometheus memory is growing fast. How do you diagnose?

Show answer

Check TSDB status page (/tsdb-status) for high-cardinality metrics. Look for labels with unbounded values. Use promtool tsdb analyze on data directory. Fix: drop labels via metric_relabel_configs, use recording rules to pre-aggregate, or increase scrape interval for noisy targets.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

3. What is head-based vs tail-based sampling in distributed tracing?

Show answer

Head-based: decide at trace start whether to sample (simple, but may miss interesting traces). Tail-based: decide after all spans arrive (can keep error/slow traces, discard boring ones). Tail-based is better for debugging but needs a collector buffer.

Remember: Distributed tracing = follow request across services. Span = segment.

Example: Jaeger, Zipkin, OpenTelemetry. Trace ID in HTTP headers.

4. When do you need Prometheus federation or remote write?

Show answer

When you have multiple Prometheus servers (per-cluster) and need a global view. Federation pulls selected series from leaf Prometheus into a global one. Remote write pushes to long-term storage (Thanos, Cortex, Mimir). Use remote write for durability; federation for lightweight aggregation.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.

5. How do you structure on-call alerting to reduce burnout?

Show answer

1) Symptom-based alerts only (no cause-based noise). 2) Every alert has a runbook. 3) Track alert-to-action ratio; prune noisy alerts quarterly. 4) Use escalation policies with timeout. 5) Follow-the-sun rotation across time zones if possible. 6) Blameless postmortems improve signal over time.

Remember: Observability ≠ monitoring. Monitoring=known questions. Observability=ask new questions.