Portal | Level: L2: Operations | Topics: Platform Engineering, Kubernetes Core, CI/CD | Domain: DevOps & Tooling
Platform Engineering Patterns - Primer¶
Why This Matters¶
Platform engineering is the discipline of building and maintaining internal platforms that make developers self-sufficient. Instead of every team reinventing deployment pipelines, observability stacks, and infrastructure provisioning, a platform team builds standardized, self-service tooling. The result: developers ship faster, operations teams aren't bottlenecks, and infrastructure stays consistent.
This is where deep ops experience meets developer productivity. Platform engineers aren't writing application code — they're building the systems that make application code reliably deployable.
Timeline: Platform engineering as a discipline emerged around 2020-2022, though the practices are older. Gartner named it a Top 10 Strategic Technology Trend for 2023. The term crystallized the shift from "DevOps as a culture" to "DevOps as a product" — instead of asking every developer to be an ops expert, build a platform team that provides ops capabilities as self-service products. The Platform Engineering community (platformengineering.org) launched in 2022 and grew rapidly.
Core Concepts¶
What a Platform Is (and Isn't)¶
A platform is not a set of mandates or tickets. It's a product — with users (developers), features (self-service capabilities), and quality standards (reliability, UX).
WITHOUT a platform:
Developer → files Jira ticket → Ops engineer provisions infra → 3-5 days later → Developer deploys
WITH a platform:
Developer → runs CLI/clicks UI → infrastructure provisioned in minutes → deploys via GitOps
The Platform Stack¶
┌──────────────────────────────────────────────────┐
│ Developer Experience │
│ (CLI tools, web portal, docs, templates) │
├──────────────────────────────────────────────────┤
│ Platform Services │
│ (CI/CD, secrets, DNS, certs, databases) │
├──────────────────────────────────────────────────┤
│ Infrastructure Orchestration │
│ (Kubernetes, Terraform, service mesh) │
├──────────────────────────────────────────────────┤
│ Infrastructure │
│ (Cloud providers, bare metal, networking) │
└──────────────────────────────────────────────────┘
Each layer abstracts the complexity below it. Developers interact with the top layer. Platform engineers build and maintain all layers.
Golden Paths¶
A golden path is an opinionated, well-supported way to accomplish a common task. It's not the only path — it's the recommended one.
Name origin: The term "golden path" (also called "paved road") was popularized by Netflix in their platform engineering talks around 2017. The metaphor: instead of forcing developers onto a single road, pave one road so well that everyone chooses it voluntarily. Spotify calls them "golden paths," Netflix calls them "paved roads," Google calls them "blessed paths" — same concept, different names.
Example: Deploy a New Service¶
Golden Path:
1. Run `platform create service --name my-app --lang python`
→ Generates repo from template (Dockerfile, CI pipeline, Helm chart, tests)
2. Push code to main branch
→ CI builds, tests, pushes image, updates Helm values
3. ArgoCD detects change
→ Deploys to dev automatically, staging via PR, prod via approval
4. Observability auto-configured
→ Prometheus scraping, Grafana dashboards, Loki log aggregation, alerts
What the developer DOESN'T have to do:
- Write a Dockerfile
- Configure CI/CD
- Create Kubernetes manifests
- Set up monitoring
- Request DNS entries
- Provision certificates
Template Repositories¶
Golden paths start with templates:
service-template/
├── .github/workflows/
│ └── ci.yml # Pre-configured CI pipeline
├── deploy/
│ ├── Dockerfile # Multi-stage, security-scanned
│ ├── helm/
│ │ ├── Chart.yaml
│ │ ├── values.yaml
│ │ └── templates/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ ├── servicemonitor.yaml # Auto-configured Prometheus
│ │ └── ingress.yaml
│ └── argocd-app.yaml
├── src/
│ └── main.py # Hello-world with /health endpoint
├── tests/
│ └── test_main.py
├── Makefile
└── README.md
Self-Service Infrastructure¶
Infrastructure as Catalog Items¶
Instead of Terraform modules that developers need to understand, expose infrastructure as catalog items:
# platform-catalog.yaml
services:
postgres:
description: "Managed PostgreSQL database"
sizes:
small: { cpu: 1, memory: 2Gi, storage: 20Gi }
medium: { cpu: 2, memory: 4Gi, storage: 100Gi }
large: { cpu: 4, memory: 16Gi, storage: 500Gi }
features:
- automated-backups
- point-in-time-recovery
- read-replicas (medium, large only)
redis:
description: "Redis cache cluster"
sizes:
small: { memory: 512Mi }
medium: { memory: 2Gi }
large: { memory: 8Gi, replicas: 3 }
s3-bucket:
description: "Object storage bucket"
options:
versioning: true
encryption: AES-256
lifecycle_days: 90
Kubernetes Operators for Platform Services¶
Custom operators provision platform services via CRDs:
# Developer creates this:
apiVersion: platform.internal/v1
kind: Database
metadata:
name: myapp-db
namespace: myapp
spec:
engine: postgres
version: "15"
size: medium
backup:
schedule: "0 2 * * *"
retention: 30d
# The platform operator:
# 1. Provisions a PostgreSQL StatefulSet
# 2. Creates a Secret with credentials
# 3. Sets up automated backups
# 4. Configures monitoring & alerts
# 5. Registers in service catalog
Internal Developer Portal¶
What It Contains¶
| Section | Content |
|---|---|
| Service catalog | All running services, owners, health status |
| Documentation | API docs, runbooks, architecture diagrams |
| Templates | Golden path templates for new services |
| Self-service | Provisioning, DNS, certs, secrets |
| Scorecards | Production readiness, security compliance |
Backstage (Popular Open-Source Portal)¶
# catalog-info.yaml — registered in each service repo
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: grokdevops
description: FastAPI service with Prometheus metrics
annotations:
github.com/project-slug: your-org/grokdevops
prometheus.io/rule: 'rate(http_requests_total{service="grokdevops"}[5m])'
argocd/app-name: grokdevops
tags:
- python
- fastapi
spec:
type: service
lifecycle: production
owner: platform-team
system: core-infrastructure
dependsOn:
- resource:default/postgres
providesApis:
- grokdevops-api
CI/CD as a Platform Service¶
Standardized Pipelines¶
Instead of every team writing their own CI, provide reusable workflow templates:
# .github/workflows/ci.yml — in each service repo
name: CI
on: [push, pull_request]
jobs:
build:
uses: your-org/platform-workflows/.github/workflows/build-test-deploy.yml@v2
with:
language: python
python-version: "3.11"
helm-chart-path: deploy/helm
secrets: inherit
The shared workflow handles: - Lint, test, build - Security scanning (Trivy, SAST) - Container image build + push - Helm chart validation - Image tag promotion to GitOps repo
Environment Promotion¶
main branch push
→ Build + Test
→ Push image (sha-abc123)
→ Update dev overlay (auto-deploy)
→ Run smoke tests against dev
→ Create staging PR (manual merge)
→ Run integration tests against staging
→ Create prod PR (requires approval)
→ Deploy to prod (canary → full)
Platform Metrics¶
What to measure to know if your platform is working:
| Metric | What It Tells You |
|---|---|
| Lead time for changes | How long from commit to production |
| Deployment frequency | How often teams can deploy |
| Time to onboard | How long for a new service to reach production |
| Self-service ratio | % of infra requests handled without a ticket |
| Mean time to recover | How quickly teams can fix production issues |
| Platform adoption | % of teams using golden paths vs custom |
| Developer satisfaction | NPS or survey scores from platform users |
Production Readiness¶
Scorecards¶
Automated checks that verify a service meets production standards:
production-readiness:
required:
- health-endpoint: "Service exposes /health"
- readiness-endpoint: "Service exposes /ready"
- resource-limits: "CPU and memory limits set"
- replicas: "At least 2 replicas in prod"
- monitoring: "ServiceMonitor exists"
- alerts: "At least 1 alert rule defined"
- runbook: "Runbook linked in docs"
- owner: "Team ownership declared"
recommended:
- pdb: "PodDisruptionBudget configured"
- hpa: "Autoscaling configured"
- network-policy: "Network policies in place"
- security-scan: "No critical vulnerabilities"
- slo: "SLO defined and tracked"
Common Pitfalls¶
Interview tip: When asked about platform engineering in interviews, the key insight is: "Treat your platform as a product, your developers as customers." This means user research (talk to devs), product roadmap (prioritize by pain), adoption metrics (track usage), and iterating based on feedback — not mandating from above.
Analogy: A good internal platform is like a well-designed airport. Travelers (developers) should be able to check in, board, and fly (ship code) without understanding air traffic control, fuel logistics, or runway maintenance. But when something goes wrong, the control tower (platform team) needs full visibility into every layer.
- Building a platform nobody asked for — Talk to your developers first. Solve their actual pain points, not what you think they need.
- Mandating without value — If the golden path is harder than the cowpath, developers will route around it.
- Platform as bottleneck — If every request still needs a platform team member, you've just renamed "ops" to "platform."
- Over-abstracting — Hide complexity, but don't hide it so deeply that debugging becomes impossible.
- Ignoring the escape hatch — Some teams have legitimate reasons to go off the golden path. Make that possible (but tracked).
Wiki Navigation¶
Prerequisites¶
- Kubernetes Ops (Production) (Topic Pack, L2)
- CI/CD Pipelines & Patterns (Topic Pack, L1)
Next Steps¶
- Multi-Tenancy Patterns (Topic Pack, L2)
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — CI/CD, Kubernetes Core
- Mental Models (Core Concepts) (Topic Pack, L0) — CI/CD, Kubernetes Core
- Backstage & Developer Portals (Topic Pack, L2) — Platform Engineering
- CI Pipeline Documentation (Reference, L1) — CI/CD
- CI/CD Drills (Drill, L1) — CI/CD
- CI/CD Flashcards (CLI) (flashcard_deck, L1) — CI/CD
- CI/CD Pipelines & Patterns (Topic Pack, L1) — CI/CD
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core