Anti-Primer: Cilium¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A platform team is deploying Cilium to manage traffic between 50 microservices. The team chose the technology based on a conference talk and is learning it in production. The rollout must complete before the Q4 feature freeze.
The Timeline¶
Hour 0: Sidecar Resource Overhead¶
Deploys sidecars with default resource settings without accounting for the overhead on every pod. The deadline was looming, and this seemed like the fastest path forward. But the result is cluster resource utilization jumps 30%; pods are evicted; node autoscaler cannot keep up.
Footgun #1: Sidecar Resource Overhead — deploys sidecars with default resource settings without accounting for the overhead on every pod, leading to cluster resource utilization jumps 30%; pods are evicted; node autoscaler cannot keep up.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: mTLS Breaks Existing Traffic¶
Enables strict mTLS globally without verifying all services have valid certificates. Under time pressure, the team chose speed over caution. But the result is services without certificates lose all connectivity; 15 microservices go down simultaneously.
Footgun #2: mTLS Breaks Existing Traffic — enables strict mTLS globally without verifying all services have valid certificates, leading to services without certificates lose all connectivity; 15 microservices go down simultaneously.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Retry Storm¶
Configures aggressive retry policies (5 retries, no backoff) on every service. Nobody pushed back because the shortcut looked harmless in the moment. But the result is a slow downstream service triggers exponential retry amplification; 10x traffic spike causes cascading failure.
Footgun #3: Retry Storm — configures aggressive retry policies (5 retries, no backoff) on every service, leading to a slow downstream service triggers exponential retry amplification; 10x traffic spike causes cascading failure.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Observability Data Explosion¶
Enables full distributed tracing on every request at 100% sampling. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is trace storage fills in 2 days; tracing backend becomes more expensive than the application.
Footgun #4: Observability Data Explosion — enables full distributed tracing on every request at 100% sampling, leading to trace storage fills in 2 days; tracing backend becomes more expensive than the application.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Sidecar Resource Overhead | Cluster resource utilization jumps 30%; pods are evicted; node autoscaler cannot keep up | Primer: Budget for sidecar resource overhead; adjust pod resource requests accordingly |
| 2 | mTLS Breaks Existing Traffic | Services without certificates lose all connectivity; 15 microservices go down simultaneously | Primer: Enable permissive mTLS first; verify all services can communicate; then switch to strict |
| 3 | Retry Storm | A slow downstream service triggers exponential retry amplification; 10x traffic spike causes cascading failure | Primer: Conservative retry budgets with exponential backoff; circuit breakers before retries |
| 4 | Observability Data Explosion | Trace storage fills in 2 days; tracing backend becomes more expensive than the application | Primer: Sample traces (1-10%); use tail-based sampling; set retention policies from day one |
Damage Report¶
- Downtime: 2-4 hours of degraded or unavailable service
- Data loss: Potential, depending on the failure mode and backup state
- Customer impact: Visible errors, degraded performance, or complete outage for affected users
- Engineering time to remediate: 8-16 engineer-hours across incident response and follow-up
- Reputation cost: Internal trust erosion; possible external customer-facing apology
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on sidecar resource overhead, they would have learned: Budget for sidecar resource overhead; adjust pod resource requests accordingly.
- Footgun #2: If the engineer had read the primer, section on mtls breaks existing traffic, they would have learned: Enable permissive mTLS first; verify all services can communicate; then switch to strict.
- Footgun #3: If the engineer had read the primer, section on retry storm, they would have learned: Conservative retry budgets with exponential backoff; circuit breakers before retries.
- Footgun #4: If the engineer had read the primer, section on observability data explosion, they would have learned: Sample traces (1-10%); use tail-based sampling; set retention policies from day one.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice