Portal | Level: L2: Operations | Topics: Platform Engineering, Kubernetes Core, CI/CD | Domain: DevOps & Tooling

Platform Engineering Patterns - Primer¶

Why This Matters¶

Platform engineering is the discipline of building and maintaining internal platforms that make developers self-sufficient. Instead of every team reinventing deployment pipelines, observability stacks, and infrastructure provisioning, a platform team builds standardized, self-service tooling. The result: developers ship faster, operations teams aren't bottlenecks, and infrastructure stays consistent.

This is where deep ops experience meets developer productivity. Platform engineers aren't writing application code — they're building the systems that make application code reliably deployable.

Timeline: Platform engineering as a discipline emerged around 2020-2022, though the practices are older. Gartner named it a Top 10 Strategic Technology Trend for 2023. The term crystallized the shift from "DevOps as a culture" to "DevOps as a product" — instead of asking every developer to be an ops expert, build a platform team that provides ops capabilities as self-service products. The Platform Engineering community (platformengineering.org) launched in 2022 and grew rapidly.

Core Concepts¶

What a Platform Is (and Isn't)¶

A platform is not a set of mandates or tickets. It's a product — with users (developers), features (self-service capabilities), and quality standards (reliability, UX).

WITHOUT a platform:
  Developer → files Jira ticket → Ops engineer provisions infra → 3-5 days later → Developer deploys

WITH a platform:
  Developer → runs CLI/clicks UI → infrastructure provisioned in minutes → deploys via GitOps

The Platform Stack¶

┌──────────────────────────────────────────────────┐
│              Developer Experience                 │
│   (CLI tools, web portal, docs, templates)       │
├──────────────────────────────────────────────────┤
│              Platform Services                    │
│   (CI/CD, secrets, DNS, certs, databases)        │
├──────────────────────────────────────────────────┤
│           Infrastructure Orchestration            │
│   (Kubernetes, Terraform, service mesh)          │
├──────────────────────────────────────────────────┤
│              Infrastructure                       │
│   (Cloud providers, bare metal, networking)      │
└──────────────────────────────────────────────────┘

Each layer abstracts the complexity below it. Developers interact with the top layer. Platform engineers build and maintain all layers.

Golden Paths¶

A golden path is an opinionated, well-supported way to accomplish a common task. It's not the only path — it's the recommended one.

Name origin: The term "golden path" (also called "paved road") was popularized by Netflix in their platform engineering talks around 2017. The metaphor: instead of forcing developers onto a single road, pave one road so well that everyone chooses it voluntarily. Spotify calls them "golden paths," Netflix calls them "paved roads," Google calls them "blessed paths" — same concept, different names.

Example: Deploy a New Service¶

Golden Path:
1. Run `platform create service --name my-app --lang python`
   → Generates repo from template (Dockerfile, CI pipeline, Helm chart, tests)
2. Push code to main branch
   → CI builds, tests, pushes image, updates Helm values
3. ArgoCD detects change
   → Deploys to dev automatically, staging via PR, prod via approval
4. Observability auto-configured
   → Prometheus scraping, Grafana dashboards, Loki log aggregation, alerts

What the developer DOESN'T have to do:
- Write a Dockerfile
- Configure CI/CD
- Create Kubernetes manifests
- Set up monitoring
- Request DNS entries
- Provision certificates

Template Repositories¶

Golden paths start with templates:

service-template/
  ├── .github/workflows/
  │   └── ci.yml                 # Pre-configured CI pipeline
  ├── deploy/
  │   ├── Dockerfile             # Multi-stage, security-scanned
  │   ├── helm/
  │   │   ├── Chart.yaml
  │   │   ├── values.yaml
  │   │   └── templates/
  │   │       ├── deployment.yaml
  │   │       ├── service.yaml
  │   │       ├── servicemonitor.yaml  # Auto-configured Prometheus
  │   │       └── ingress.yaml
  │   └── argocd-app.yaml
  ├── src/
  │   └── main.py                # Hello-world with /health endpoint
  ├── tests/
  │   └── test_main.py
  ├── Makefile
  └── README.md

Self-Service Infrastructure¶

Infrastructure as Catalog Items¶

Instead of Terraform modules that developers need to understand, expose infrastructure as catalog items:

# platform-catalog.yaml
services:
  postgres:
    description: "Managed PostgreSQL database"
    sizes:
      small:  { cpu: 1, memory: 2Gi, storage: 20Gi }
      medium: { cpu: 2, memory: 4Gi, storage: 100Gi }
      large:  { cpu: 4, memory: 16Gi, storage: 500Gi }
    features:
      - automated-backups
      - point-in-time-recovery
      - read-replicas (medium, large only)

  redis:
    description: "Redis cache cluster"
    sizes:
      small:  { memory: 512Mi }
      medium: { memory: 2Gi }
      large:  { memory: 8Gi, replicas: 3 }

  s3-bucket:
    description: "Object storage bucket"
    options:
      versioning: true
      encryption: AES-256
      lifecycle_days: 90

Kubernetes Operators for Platform Services¶

Custom operators provision platform services via CRDs:

# Developer creates this:
apiVersion: platform.internal/v1
kind: Database
metadata:
  name: myapp-db
  namespace: myapp
spec:
  engine: postgres
  version: "15"
  size: medium
  backup:
    schedule: "0 2 * * *"
    retention: 30d

# The platform operator:
# 1. Provisions a PostgreSQL StatefulSet
# 2. Creates a Secret with credentials
# 3. Sets up automated backups
# 4. Configures monitoring & alerts
# 5. Registers in service catalog

Internal Developer Portal¶

What It Contains¶

Section	Content
Service catalog	All running services, owners, health status
Documentation	API docs, runbooks, architecture diagrams
Templates	Golden path templates for new services
Self-service	Provisioning, DNS, certs, secrets
Scorecards	Production readiness, security compliance

Backstage (Popular Open-Source Portal)¶

# catalog-info.yaml — registered in each service repo
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: grokdevops
  description: FastAPI service with Prometheus metrics
  annotations:
    github.com/project-slug: your-org/grokdevops
    prometheus.io/rule: 'rate(http_requests_total{service="grokdevops"}[5m])'
    argocd/app-name: grokdevops
  tags:
    - python
    - fastapi
spec:
  type: service
  lifecycle: production
  owner: platform-team
  system: core-infrastructure
  dependsOn:
    - resource:default/postgres
  providesApis:
    - grokdevops-api

CI/CD as a Platform Service¶

Standardized Pipelines¶

Instead of every team writing their own CI, provide reusable workflow templates:

# .github/workflows/ci.yml — in each service repo
name: CI
on: [push, pull_request]
jobs:
  build:
    uses: your-org/platform-workflows/.github/workflows/build-test-deploy.yml@v2
    with:
      language: python
      python-version: "3.11"
      helm-chart-path: deploy/helm
    secrets: inherit

The shared workflow handles: - Lint, test, build - Security scanning (Trivy, SAST) - Container image build + push - Helm chart validation - Image tag promotion to GitOps repo

Environment Promotion¶

main branch push
  → Build + Test
  → Push image (sha-abc123)
  → Update dev overlay (auto-deploy)
  → Run smoke tests against dev
  → Create staging PR (manual merge)
  → Run integration tests against staging
  → Create prod PR (requires approval)
  → Deploy to prod (canary → full)

Platform Metrics¶

What to measure to know if your platform is working:

Metric	What It Tells You
Lead time for changes	How long from commit to production
Deployment frequency	How often teams can deploy
Time to onboard	How long for a new service to reach production
Self-service ratio	% of infra requests handled without a ticket
Mean time to recover	How quickly teams can fix production issues
Platform adoption	% of teams using golden paths vs custom
Developer satisfaction	NPS or survey scores from platform users

Production Readiness¶

Scorecards¶

Automated checks that verify a service meets production standards:

production-readiness:
  required:
    - health-endpoint:     "Service exposes /health"
    - readiness-endpoint:  "Service exposes /ready"
    - resource-limits:     "CPU and memory limits set"
    - replicas:            "At least 2 replicas in prod"
    - monitoring:          "ServiceMonitor exists"
    - alerts:              "At least 1 alert rule defined"
    - runbook:             "Runbook linked in docs"
    - owner:               "Team ownership declared"

  recommended:
    - pdb:                 "PodDisruptionBudget configured"
    - hpa:                 "Autoscaling configured"
    - network-policy:      "Network policies in place"
    - security-scan:       "No critical vulnerabilities"
    - slo:                 "SLO defined and tracked"

Common Pitfalls¶

Interview tip: When asked about platform engineering in interviews, the key insight is: "Treat your platform as a product, your developers as customers." This means user research (talk to devs), product roadmap (prioritize by pain), adoption metrics (track usage), and iterating based on feedback — not mandating from above.

Analogy: A good internal platform is like a well-designed airport. Travelers (developers) should be able to check in, board, and fly (ship code) without understanding air traffic control, fuel logistics, or runway maintenance. But when something goes wrong, the control tower (platform team) needs full visibility into every layer.

Building a platform nobody asked for — Talk to your developers first. Solve their actual pain points, not what you think they need.
Mandating without value — If the golden path is harder than the cowpath, developers will route around it.
Platform as bottleneck — If every request still needs a platform team member, you've just renamed "ops" to "platform."
Over-abstracting — Hide complexity, but don't hide it so deeply that debugging becomes impossible.
Ignoring the escape hatch — Some teams have legitimate reasons to go off the golden path. Make that possible (but tracked).

Prerequisites¶

Kubernetes Ops (Production) (Topic Pack, L2)
CI/CD Pipelines & Patterns (Topic Pack, L1)

Next Steps¶

Multi-Tenancy Patterns (Topic Pack, L2)

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — CI/CD, Kubernetes Core
Mental Models (Core Concepts) (Topic Pack, L0) — CI/CD, Kubernetes Core
Backstage & Developer Portals (Topic Pack, L2) — Platform Engineering
CI Pipeline Documentation (Reference, L1) — CI/CD
CI/CD Drills (Drill, L1) — CI/CD
CI/CD Flashcards (CLI) (flashcard_deck, L1) — CI/CD
CI/CD Pipelines & Patterns (Topic Pack, L1) — CI/CD
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core

Platform Engineering Patterns - Primer¶

Why This Matters¶

Core Concepts¶

What a Platform Is (and Isn't)¶

The Platform Stack¶

Golden Paths¶

Example: Deploy a New Service¶

Template Repositories¶

Self-Service Infrastructure¶

Infrastructure as Catalog Items¶

Kubernetes Operators for Platform Services¶

Internal Developer Portal¶

What It Contains¶

Backstage (Popular Open-Source Portal)¶

CI/CD as a Platform Service¶

Standardized Pipelines¶

Environment Promotion¶

Platform Metrics¶

Production Readiness¶

Scorecards¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Pages that link here¶

Platform Engineering Patterns - Primer¶

Why This Matters¶

Core Concepts¶

What a Platform Is (and Isn't)¶

The Platform Stack¶

Golden Paths¶

Example: Deploy a New Service¶

Template Repositories¶

Self-Service Infrastructure¶

Infrastructure as Catalog Items¶

Kubernetes Operators for Platform Services¶

Internal Developer Portal¶

What It Contains¶

Backstage (Popular Open-Source Portal)¶

CI/CD as a Platform Service¶

Standardized Pipelines¶

Environment Promotion¶

Platform Metrics¶

Production Readiness¶

Scorecards¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Related Content¶

Pages that link here¶