Anti-Primer: Platform Engineering¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A platform team is building Platform Engineering capabilities for a growing engineering organization. The team of three must support 80 developers across 12 teams. They are building under pressure to show value before the next budget cycle.
The Timeline¶
Hour 0: Building Without User Research¶
Builds the platform features the team thinks developers need without actually asking them. The deadline was looming, and this seemed like the fastest path forward. But the result is developers ignore the platform and build their own tooling; effort is wasted.
Footgun #1: Building Without User Research — builds the platform features the team thinks developers need without actually asking them, leading to developers ignore the platform and build their own tooling; effort is wasted.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: No Self-Service Documentation¶
Platform requires the platform team to onboard every new service manually. Under time pressure, the team chose speed over caution. But the result is platform team becomes a bottleneck; developers wait days for access; resentment builds.
Footgun #2: No Self-Service Documentation — platform requires the platform team to onboard every new service manually, leading to platform team becomes a bottleneck; developers wait days for access; resentment builds.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Breaking Changes Without Migration Path¶
Pushes a breaking API change to the internal platform without a deprecation period. Nobody pushed back because the shortcut looked harmless in the moment. But the result is 12 CI/CD pipelines break simultaneously on Monday morning; platform team credibility is damaged.
Footgun #3: Breaking Changes Without Migration Path — pushes a breaking API change to the internal platform without a deprecation period, leading to 12 CI/CD pipelines break simultaneously on Monday morning; platform team credibility is damaged.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Over-Engineering the Abstraction¶
Builds a complex abstraction layer that covers every edge case but is impossible to debug. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is developers cannot troubleshoot issues; every platform problem escalates to the platform team.
Footgun #4: Over-Engineering the Abstraction — builds a complex abstraction layer that covers every edge case but is impossible to debug, leading to developers cannot troubleshoot issues; every platform problem escalates to the platform team.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Building Without User Research | Developers ignore the platform and build their own tooling; effort is wasted | Primer: Interview internal users before building; treat platform features like product features |
| 2 | No Self-Service Documentation | Platform team becomes a bottleneck; developers wait days for access; resentment builds | Primer: Self-service onboarding with documentation; automate the golden path |
| 3 | Breaking Changes Without Migration Path | 12 CI/CD pipelines break simultaneously on Monday morning; platform team credibility is damaged | Primer: Versioned APIs, deprecation notices, and migration guides before any breaking change |
| 4 | Over-Engineering the Abstraction | Developers cannot troubleshoot issues; every platform problem escalates to the platform team | Primer: Thin abstractions that are transparent and debuggable; escape hatches for advanced users |
Damage Report¶
- Downtime: 2-4 hours of degraded or unavailable service
- Data loss: Potential, depending on the failure mode and backup state
- Customer impact: Visible errors, degraded performance, or complete outage for affected users
- Engineering time to remediate: 8-16 engineer-hours across incident response and follow-up
- Reputation cost: Internal trust erosion; possible external customer-facing apology
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on building without user research, they would have learned: Interview internal users before building; treat platform features like product features.
- Footgun #2: If the engineer had read the primer, section on no self-service documentation, they would have learned: Self-service onboarding with documentation; automate the golden path.
- Footgun #3: If the engineer had read the primer, section on breaking changes without migration path, they would have learned: Versioned APIs, deprecation notices, and migration guides before any breaking change.
- Footgun #4: If the engineer had read the primer, section on over-engineering the abstraction, they would have learned: Thin abstractions that are transparent and debuggable; escape hatches for advanced users.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice