Platform Engineering Footguns¶

Mistakes that make your platform a liability instead of an accelerator.

1. Building the platform before talking to developers¶

You spend 6 months building an elaborate self-service portal. Developers wanted a simple CLI. Nobody uses the portal. You've built a product for an audience that doesn't exist.

Fix: Interview your users. Start with their top 3 pain points. Build the minimum viable platform that solves those. Iterate based on adoption and feedback.

War story: Spotify's Backstage started as an internal tool to solve a specific pain point: developers couldn't find which team owned which service. They didn't build a portal — they built a service catalog. The platform grew from there based on actual developer needs, not architectural ambition.

2. Abstracting away debuggability¶

Your platform hides Kubernetes so well that when a pod crashes, the developer can't figure out why. They can't see logs, can't describe pods, can't check events. They file a ticket and wait.

Fix: Hide complexity for the happy path. Expose diagnostics for the unhappy path. platform logs my-service and platform debug my-service should work. Developers need to see what's happening when things break.

Remember: The "pit of success" design principle: make the right thing easy and the wrong thing hard. Your platform should make deploying correctly trivial (golden path), make debugging possible (escape hatch), and make breaking things difficult (guardrails). If developers can't debug without filing a ticket, the platform is a wall, not a path.

3. No escape hatch¶

Every service must use the golden path. A team has a legitimate edge case (GPU workloads, non-HTTP protocols, external SaaS integration). They can't deploy it through the platform. They work around it with shadow infrastructure.

Fix: Support off-ramps. Let teams use raw Kubernetes manifests when needed. Track non-standard deployments so you know where your platform doesn't fit.

4. Platform changes that break existing services¶

You update the shared Helm template to add a mandatory network policy. Every service re-deploys. Half of them break because they relied on cross-namespace traffic that's now blocked.

Fix: Version your platform components with semver. Never introduce breaking changes in minor versions. Provide migration guides. Roll out gradually with opt-in before mandatory.

Gotcha: Shared Helm templates are the most common source of platform-caused outages. When you change a shared template, every helm upgrade on every service picks up the change. Pin shared chart versions per-service and let teams opt in to upgrades.

5. Single cluster, no DR¶

Your entire platform runs on one Kubernetes cluster. The cloud provider has a regional outage. Everything is down. There's no failover because you never built one.

Fix: At minimum, have a documented procedure for standing up a new cluster. Better: run active-passive or active-active across regions. Test failover quarterly.

6. Shared CI secrets across all teams¶

Your shared CI workflow uses org-level secrets for Docker registry, AWS credentials, and deploy keys. Every repo can access every secret. One compromised repo exposes everything.

Fix: Scope secrets to environments and repos. Use OIDC federation instead of long-lived credentials. Give each team their own service accounts with least-privilege access.

7. No observability for the platform itself¶

You built monitoring for application teams but not for your own platform services. ArgoCD runs out of memory and starts failing syncs. Nobody notices for 2 hours because there are no alerts on ArgoCD itself.

Fix: Your platform services are production services. They need the same monitoring, alerting, and on-call coverage as any application. Eat your own dog food.

8. Documentation that's always out of date¶

Your platform docs describe v1 behavior. The platform is on v3. Developers follow the docs, hit errors, and lose trust in the platform. They start asking colleagues instead of reading docs.

Fix: Treat docs as code. Test documentation steps in CI. Link docs to the version they describe. Make updating docs part of every platform change PR.

9. Mandating the platform without supporting it¶

Management decides all teams must use the platform by Q3. The platform team has 3 engineers supporting 20 teams. Response time on support requests is 3 days. Teams are forced onto a platform that can't support them.

Fix: Scale support with the mandate. Define clear SLOs for platform support (e.g., P1 response in 30 minutes). Automate common support requests. Build a knowledge base from repeated questions.

Under the hood: Team Topologies framework recommends a ratio of 1 platform engineer per 5-10 product teams for a mature platform, or 1:3-5 during active development. Below this ratio, the platform team becomes a bottleneck and developers route around the platform — defeating its purpose. Track "time to first deploy" for new teams as your key adoption metric.

10. Coupling platform upgrades to application deploys¶

Upgrading the platform (new Kubernetes version, new Istio, new cert-manager) triggers redeployment of all applications. An application team's Friday deploy fails because a platform upgrade changed an API version they depend on.

Fix: Decouple platform upgrades from application deploys. Platform infrastructure changes should not trigger application redeployments. Test platform changes against existing running workloads, not freshly deployed ones.

Default trap: ArgoCD's default sync policy re-deploys applications when their parent Application resource changes. A platform team updating ArgoCD's ApplicationSet template can trigger simultaneous redeployments of every service in the cluster. Use syncPolicy.automated.selfHeal: false on critical applications during platform upgrades.