Skip to content

Anti-Primer: Kubernetes Ecosystem

Everything that can go wrong, will — and in this story, it does.

The Setup

A team is building a new Kubernetes platform, installing 15 ecosystem tools (Istio, ArgoCD, Prometheus, Cert-Manager, etc.) in a single week. Each engineer picks up a tool and installs it independently, with no coordination on versions or resource allocation.

The Timeline

Hour 0: CRD Version Conflicts

Two tools install different versions of the same CRD (e.g., certificates.cert-manager.io). The deadline was looming, and this seemed like the fastest path forward. But the result is one tool silently overwrites the other's CRD; resources created by the first tool become invalid.

Footgun #1: CRD Version Conflicts — two tools install different versions of the same CRD (e.g., certificates.cert-manager.io), leading to one tool silently overwrites the other's CRD; resources created by the first tool become invalid.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Resource Starvation

Installs 15 operators without setting resource limits; each one requests 'recommended' defaults. Under time pressure, the team chose speed over caution. But the result is system pods are evicted because ecosystem tools consume all available memory.

Footgun #2: Resource Starvation — installs 15 operators without setting resource limits; each one requests 'recommended' defaults, leading to system pods are evicted because ecosystem tools consume all available memory.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: No Upgrade Path Planned

Installs tools via kubectl apply of raw YAML from GitHub releases. Nobody pushed back because the shortcut looked harmless in the moment. But the result is six months later, upgrading any tool requires manual diffing of hundreds of YAML lines.

Footgun #3: No Upgrade Path Planned — installs tools via kubectl apply of raw YAML from GitHub releases, leading to six months later, upgrading any tool requires manual diffing of hundreds of YAML lines.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Webhook Timeout Cascade

Multiple admission webhooks are installed without failure policies. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is one webhook's pod crashes and blocks all API server requests; cluster is effectively frozen.

Footgun #4: Webhook Timeout Cascade — multiple admission webhooks are installed without failure policies, leading to one webhook's pod crashes and blocks all API server requests; cluster is effectively frozen.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 CRD Version Conflicts One tool silently overwrites the other's CRD; resources created by the first tool become invalid Primer: Coordinate CRD installations; check for conflicts before installing new operators
2 Resource Starvation System pods are evicted because ecosystem tools consume all available memory Primer: Budget cluster resources for platform tools before installing; set resource limits
3 No Upgrade Path Planned Six months later, upgrading any tool requires manual diffing of hundreds of YAML lines Primer: Use Helm charts or operators with proper release management
4 Webhook Timeout Cascade One webhook's pod crashes and blocks all API server requests; cluster is effectively frozen Primer: Set failurePolicy: Ignore on non-critical webhooks; monitor webhook latency

Damage Report

  • Downtime: 2-4 hours of pod-level or cluster-wide disruption
  • Data loss: Risk of volume data loss if StatefulSets were affected
  • Customer impact: Intermittent 5xx errors, dropped connections, or full service outage
  • Engineering time to remediate: 10-20 engineer-hours for incident response, rollback, and postmortem
  • Reputation cost: On-call fatigue; delayed feature work; possible SLA breach notification

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on crd version conflicts, they would have learned: Coordinate CRD installations; check for conflicts before installing new operators.
  • Footgun #2: If the engineer had read the primer, section on resource starvation, they would have learned: Budget cluster resources for platform tools before installing; set resource limits.
  • Footgun #3: If the engineer had read the primer, section on no upgrade path planned, they would have learned: Use Helm charts or operators with proper release management.
  • Footgun #4: If the engineer had read the primer, section on webhook timeout cascade, they would have learned: Set failurePolicy: Ignore on non-critical webhooks; monitor webhook latency.

Cross-References