Tracing¶

29 cards — 🟢 4 easy | 🟡 9 medium | 🔴 7 hard

🟢 Easy (4)¶

1. What is a trace in distributed tracing?

Show answer

A trace represents the full journey of a request through a distributed system. It is a directed acyclic graph of spans, showing how the request flows across services.

Remember: "Trace = end-to-end request journey. Span = one unit of work within a trace." A trace is a tree of spans.

2. What is a span, and what data does it contain?

Show answer

A span represents a single unit of work within a trace. It has a name, start time, duration, parent span reference, key-value attributes (tags), and timestamped events.

Remember: "Trace context propagation = passing trace IDs across service boundaries." Without propagation, you get disconnected spans.

3. Why must trace context be propagated between services?

Show answer

Without propagation, each service creates isolated traces with no way to connect them. Propagating trace and span IDs in request headers (e.g., W3C traceparent) links all spans into a single end-to-end trace.

Remember: "OpenTelemetry (OTel) = vendor-neutral observability framework." It handles traces, metrics, and logs with one SDK.

4. How do traces complement metrics and logs in the three pillars of observability?

Show answer

Metrics show that something is wrong (high latency, error rate spike). Logs show what happened in a single service. Traces show why by revealing the full request path across services, exposing which service introduced latency or errors. Use metrics to detect, traces to diagnose, logs for detail.

🟡 Medium (9)¶

1. What is the difference between head-based and tail-based sampling?

Show answer

Head-based sampling decides whether to trace at request start (e.g., sample 10%). Tail-based sampling decides after the trace completes, allowing you to keep errors and slow traces. Head is simpler; tail catches anomalies.

2. What information does the W3C traceparent header contain?

Show answer

It contains version (2 hex), trace-id (32 hex), parent span-id (16 hex), and trace flags (2 hex, where 01 = sampled). Example: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01.

Remember: "Sampling controls trace volume." Head sampling = decide at start, Tail sampling = decide at end (keeps errors/slow). 100% sampling crushes storage.

3. What is the role of the OpenTelemetry Collector?

Show answer

The OTel Collector receives telemetry data (traces, metrics, logs) from instrumented applications, processes it (batching, filtering, enrichment), and exports it to one or more backends like Jaeger, Tempo, or Datadog.

Remember: "Span attributes = structured metadata." Add business context: user_id, order_id, region.

4. How do you correlate traces with logs?

Show answer

Inject the trace_id and span_id into log lines. This allows jumping from a log entry to the full trace in the tracing UI, connecting structured logs to the request's end-to-end journey.

5. What is the difference between automatic and manual instrumentation in OpenTelemetry?

Show answer

Automatic instrumentation uses agents or library hooks to create spans for known frameworks (HTTP, DB, gRPC) with zero code changes. Manual instrumentation uses the SDK to create custom spans for business logic.
Best practice: use auto-instrumentation as a baseline, add manual spans for critical business operations.

6. What is OpenTelemetry Baggage, and how does it differ from span attributes?

Show answer

Baggage is key-value data propagated across service boundaries via headers (like trace context). Unlike span attributes (local to one span), baggage travels with the request. Use it for cross-cutting concerns like tenant-id or feature-flag, but keep it small — every downstream service receives it.

7. What causes orphaned spans, and how do you debug them?

Show answer

Orphaned spans have no parent and cannot be joined to a trace. Common causes: context not propagated (missing middleware), async boundary drops context, service restart loses in-flight context, or clock skew makes spans appear disconnected. Debug by checking:
1) propagation headers in requests,
2) instrumentation gaps,
3) collector pipeline for dropped spans.

8. When should you use span links instead of parent-child relationships?

Show answer

Span links connect causally related spans that are not in a direct parent-child hierarchy — for example, a batch job processing items from a queue where each item originated in a different trace. The link preserves the connection without forcing all items into one trace tree.

9. What is critical path analysis in distributed tracing and how does it identify bottlenecks?

Show answer

The critical path is the longest chain of sequential spans from trace start to end. Spans on the critical path directly add to total latency; parallel spans off the critical path do not. Optimizing the longest span on the critical path yields the biggest latency reduction.

🔴 Hard (7)¶

1. How does Grafana Tempo differ from Jaeger in its storage approach?

Show answer

Tempo uses object storage (S3, GCS) without indexing — traces are retrieved by trace ID directly or discovered through logs/metrics. Jaeger indexes spans for searchability. Tempo is cheaper at scale but requires trace ID discovery from other signals.

2. What happens when one service in a call chain is not instrumented for tracing?

Show answer

The trace chain breaks at that service. Downstream spans cannot be linked to upstream spans, resulting in disconnected trace fragments. This is the most common tracing deployment failure.

Remember: "Three pillars of observability: Logs, Metrics, Traces." Logs = events, Metrics = aggregates, Traces = request flows.

3. What should you never include in span attributes, and why?

Show answer

Never include PII (personal data) or secrets in span attributes. Trace data is stored in backends accessible to many engineers and may be retained for weeks. Sensitive data in spans creates compliance and security exposure.

4. How do you maintain trace context across asynchronous message queues?

Show answer

Inject the trace context (traceparent header) into the message metadata/headers when producing. On the consumer side, extract the context and create a new span linked to the producer span. This creates a causal link even though the spans are not parent-child (use Span Links in OTel).

5. How do you optimize tracing costs without losing visibility into errors?

Show answer

Use tail-based sampling to keep 100% of error and high-latency traces while sampling routine traffic at 1-10%. Alternatively, use head-based sampling with a rule engine: always sample traces with specific headers (debug, canary), sample the rest probabilistically. Monitor the traces-per-second rate to stay within budget.

6. What are the three pipeline stages in an OpenTelemetry Collector and why separate them?

Show answer

Receivers (accept data in various formats: OTLP, Jaeger, Zipkin), Processors (transform, filter, batch, sample), Exporters (send to backends: Tempo, Jaeger, OTLP). Separation lets you receive in one format, enrich/filter centrally, and export to multiple backends without changing instrumentation.

7. How does a service mesh like Istio inject trace context without application changes?

Show answer

Envoy sidecars automatically generate spans for inbound/outbound requests and propagate trace headers (W3C traceparent or B3). The application only needs to forward the incoming trace headers on outbound calls. Without header forwarding, traces break into disconnected segments per hop.