OpenTelemetry - Street-Level Ops¶
Quick Diagnosis Commands¶
# Check if the collector is running and healthy
curl -s http://localhost:13133/health | jq .
# View collector metrics (self-monitoring)
curl -s http://localhost:8888/metrics | grep otelcol
# Check how many spans are being received/exported
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans
curl -s http://localhost:8888/metrics | grep otelcol_exporter_sent_spans
# Check for dropped data
curl -s http://localhost:8888/metrics | grep -E "(dropped|refused|failed)"
# Verify OTLP endpoint is listening
ss -tlnp | grep -E '(4317|4318)'
# Test OTLP endpoint with grpcurl
grpcurl -plaintext localhost:4317 list
# Send a test span via HTTP
curl -X POST http://localhost:4318/v1/traces \
-H "Content-Type: application/json" \
-d '{"resourceSpans":[]}'
# Check collector logs in Kubernetes
kubectl logs -n monitoring -l app=otel-collector --tail=100 -f
# View zpages for live pipeline debugging
# (requires zpages extension enabled)
curl -s http://localhost:55679/debug/tracez | head -50
curl -s http://localhost:55679/debug/pipelinez
Gotcha: Collector Reports Healthy But No Data Flows¶
The health check extension only confirms the process is running. It does not validate pipeline connectivity.
Diagnosis:
# Check exporter errors specifically
curl -s http://localhost:8888/metrics | grep otelcol_exporter_send_failed_spans
# Enable debug exporter temporarily
# Add to your config:
# exporters:
# debug:
# verbosity: detailed
# Then add 'debug' to your pipeline's exporters list
# Check if receivers are actually getting data
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans
# If this is 0, the problem is upstream (apps not sending)
The debug exporter is your best friend. It dumps telemetry to stdout. Use it to verify data is flowing through the pipeline before blaming the backend.
Pattern: Minimal Production Collector Config¶
Start here. Add complexity only when needed.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
batch:
send_batch_size: 1024
timeout: 5s
exporters:
otlp:
endpoint: your-backend:4317
tls:
insecure: false
debug:
verbosity: basic
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679
service:
extensions: [health_check, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
Key ordering: memory_limiter MUST come before batch in processors. If batch fills up and memory_limiter is after it, the collector OOMs before the limiter can act.
Gotcha: SDK Environment Variables Silently Ignored¶
OTel SDKs read from environment variables, but typos produce zero errors:
# Correct
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_SERVICE_NAME=order-service
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
# Wrong — and you will get no error, just no data
OTEL_EXPORTER_ENDPOINT=http://collector:4317 # Missing OTLP
OTEL_SERVICE=order-service # Missing _NAME
OTEL_RESOURCE_ATTRS=deployment.environment=production # Wrong suffix
Validate with the debug exporter. If your service name shows as "unknown_service", the env var is not being read.
Pattern: Tail Sampling Without Losing Your Mind¶
Tail sampling requires the collector to hold complete traces in memory. This is where people get burned.
processors:
tail_sampling:
decision_wait: 30s # How long to wait for spans
num_traces: 50000 # Max traces in memory
expected_new_traces_per_sec: 1000
policies:
# Always keep errors
- name: keep-errors
type: status_code
status_code:
status_codes: [ERROR]
# Always keep slow traces
- name: keep-slow
type: latency
latency:
threshold_ms: 3000
# Sample 10% of the rest
- name: probabilistic-sample
type: probabilistic
probabilistic:
sampling_percentage: 10
# Always keep specific operations
- name: keep-payments
type: string_attribute
string_attribute:
key: service.name
values: [payment-service]
Critical constraint: tail sampling must run on a gateway collector, not on per-node agents. All spans for a single trace must hit the same collector instance. Use a load balancing exporter on agents:
# Agent config
exporters:
loadbalancing:
protocol:
otlp:
endpoint: gateway-collector:4317
resolver:
dns:
hostname: gateway-collector-headless
port: 4317
Gotcha: Prometheus Receiver vs OTLP Metrics¶
If you are migrating from Prometheus to OTel, be aware of metric naming:
Prometheus: http_requests_total (counter)
OTel: http.server.request.duration (histogram with different semantics)
The Prometheus receiver scrapes Prometheus endpoints and converts to OTel format. But the names and types do not magically align with OTel semantic conventions. You will have metrics with both naming schemes in your backend.
Use the transform processor to rename if needed:
processors:
transform:
metric_statements:
- context: metric
statements:
- set(name, "http.server.request.count") where name == "http_requests_total"
Pattern: Resource Detection in Kubernetes¶
Auto-populate pod, node, and namespace attributes:
processors:
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.namespace.name
- k8s.node.name
- k8s.deployment.name
labels:
- tag_name: app.label.team
key: team
from: pod
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
The collector's service account needs RBAC to read pods:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector
rules:
- apiGroups: [""]
resources: ["pods", "namespaces", "nodes"]
verbs: ["get", "list", "watch"]
Gotcha: Collector Contrib vs Core¶
There are two collector distributions:
- Core (
otelcol): Minimal set of components. Missing most receivers/exporters. - Contrib (
otelcol-contrib): Batteries included. Has Loki, Kafka, AWS, GCP exporters and dozens of receivers.
If your config references loki exporter and you deployed core, the collector exits immediately with "unknown exporter." Check the image tag in your deployment.
# Check which distribution you're running
kubectl get deployment otel-collector -n monitoring -o jsonpath='{.spec.template.spec.containers[0].image}'
# Core: otel/opentelemetry-collector:0.96.0
# Contrib: otel/opentelemetry-collector-contrib:0.96.0
Pattern: Graceful Collector Shutdown¶
When rolling the collector, in-flight data can be lost. Configure shutdown properly:
service:
telemetry:
logs:
level: info
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
In your Kubernetes deployment:
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: collector
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5"]
The batch processor flushes on shutdown, but only if it gets the termination signal and has time to drain. Set terminationGracePeriodSeconds higher than your batch timeout.
Gotcha: Context Propagation Breaks at Boundaries¶
Debug clue: If your trace shows service A calling service B, but service B appears as a separate root trace (no parent span), context propagation is broken at that boundary. Search the backend for traces with
root_span=trueandservice.name=B-- these orphaned roots are the smoking gun.
Distributed tracing requires context propagation — the trace ID must cross service boundaries. This fails silently at:
-
HTTP: Missing
traceparentheader. Check with: -
Message queues: Trace context must be injected into message headers. Most auto-instrumentation does NOT do this for Kafka/RabbitMQ.
-
gRPC: Usually works with auto-instrumentation, but custom interceptors can strip metadata.
-
Cross-language: Ensure all services use the same propagator. Default is W3C TraceContext. If one service uses B3 (Zipkin), traces break.
Pattern: Filtering High-Volume Noise¶
Health checks, readiness probes, and internal metrics chatter can drown real signals:
processors:
filter:
error_mode: ignore
traces:
span:
- 'attributes["http.target"] == "/healthz"'
- 'attributes["http.target"] == "/readyz"'
- 'attributes["http.target"] == "/metrics"'
metrics:
metric:
- 'name == "up"'
logs:
log_record:
- 'severity_number < 9' # Drop DEBUG and below
Place the filter processor BEFORE the batch processor. No point batching data you are about to drop.
Pattern: Debugging Pipeline Issues with zpages¶
Enable zpages for live introspection without touching logs:
# Pipeline status
curl http://localhost:55679/debug/pipelinez
# Recent traces through the collector
curl http://localhost:55679/debug/tracez
# Exporter queue sizes
curl http://localhost:55679/debug/rpcz
zpages are for debugging only. Do not expose them outside the cluster.