Skip to content

Portal | Level: L2: Operations | Topics: Platform Engineering, Kubernetes Core, CI/CD | Domain: DevOps & Tooling

Platform Engineering Patterns - Primer

Why This Matters

Platform engineering is the discipline of building and maintaining internal platforms that make developers self-sufficient. Instead of every team reinventing deployment pipelines, observability stacks, and infrastructure provisioning, a platform team builds standardized, self-service tooling. The result: developers ship faster, operations teams aren't bottlenecks, and infrastructure stays consistent.

This is where deep ops experience meets developer productivity. Platform engineers aren't writing application code — they're building the systems that make application code reliably deployable.

Timeline: Platform engineering as a discipline emerged around 2020-2022, though the practices are older. Gartner named it a Top 10 Strategic Technology Trend for 2023. The term crystallized the shift from "DevOps as a culture" to "DevOps as a product" — instead of asking every developer to be an ops expert, build a platform team that provides ops capabilities as self-service products. The Platform Engineering community (platformengineering.org) launched in 2022 and grew rapidly.

Core Concepts

What a Platform Is (and Isn't)

A platform is not a set of mandates or tickets. It's a product — with users (developers), features (self-service capabilities), and quality standards (reliability, UX).

WITHOUT a platform:
  Developer → files Jira ticket → Ops engineer provisions infra → 3-5 days later → Developer deploys

WITH a platform:
  Developer → runs CLI/clicks UI → infrastructure provisioned in minutes → deploys via GitOps

The Platform Stack

┌──────────────────────────────────────────────────┐
│              Developer Experience                 │
│   (CLI tools, web portal, docs, templates)       │
├──────────────────────────────────────────────────┤
│              Platform Services                    │
│   (CI/CD, secrets, DNS, certs, databases)        │
├──────────────────────────────────────────────────┤
│           Infrastructure Orchestration            │
│   (Kubernetes, Terraform, service mesh)          │
├──────────────────────────────────────────────────┤
│              Infrastructure                       │
│   (Cloud providers, bare metal, networking)      │
└──────────────────────────────────────────────────┘

Each layer abstracts the complexity below it. Developers interact with the top layer. Platform engineers build and maintain all layers.

Golden Paths

A golden path is an opinionated, well-supported way to accomplish a common task. It's not the only path — it's the recommended one.

Name origin: The term "golden path" (also called "paved road") was popularized by Netflix in their platform engineering talks around 2017. The metaphor: instead of forcing developers onto a single road, pave one road so well that everyone chooses it voluntarily. Spotify calls them "golden paths," Netflix calls them "paved roads," Google calls them "blessed paths" — same concept, different names.

Example: Deploy a New Service

Golden Path:
1. Run `platform create service --name my-app --lang python`
   → Generates repo from template (Dockerfile, CI pipeline, Helm chart, tests)
2. Push code to main branch
   → CI builds, tests, pushes image, updates Helm values
3. ArgoCD detects change
   → Deploys to dev automatically, staging via PR, prod via approval
4. Observability auto-configured
   → Prometheus scraping, Grafana dashboards, Loki log aggregation, alerts

What the developer DOESN'T have to do:
- Write a Dockerfile
- Configure CI/CD
- Create Kubernetes manifests
- Set up monitoring
- Request DNS entries
- Provision certificates

Template Repositories

Golden paths start with templates:

service-template/
  ├── .github/workflows/
  │   └── ci.yml                 # Pre-configured CI pipeline
  ├── deploy/
  │   ├── Dockerfile             # Multi-stage, security-scanned
  │   ├── helm/
  │   │   ├── Chart.yaml
  │   │   ├── values.yaml
  │   │   └── templates/
  │   │       ├── deployment.yaml
  │   │       ├── service.yaml
  │   │       ├── servicemonitor.yaml  # Auto-configured Prometheus
  │   │       └── ingress.yaml
  │   └── argocd-app.yaml
  ├── src/
  │   └── main.py                # Hello-world with /health endpoint
  ├── tests/
  │   └── test_main.py
  ├── Makefile
  └── README.md

Self-Service Infrastructure

Infrastructure as Catalog Items

Instead of Terraform modules that developers need to understand, expose infrastructure as catalog items:

# platform-catalog.yaml
services:
  postgres:
    description: "Managed PostgreSQL database"
    sizes:
      small:  { cpu: 1, memory: 2Gi, storage: 20Gi }
      medium: { cpu: 2, memory: 4Gi, storage: 100Gi }
      large:  { cpu: 4, memory: 16Gi, storage: 500Gi }
    features:
      - automated-backups
      - point-in-time-recovery
      - read-replicas (medium, large only)

  redis:
    description: "Redis cache cluster"
    sizes:
      small:  { memory: 512Mi }
      medium: { memory: 2Gi }
      large:  { memory: 8Gi, replicas: 3 }

  s3-bucket:
    description: "Object storage bucket"
    options:
      versioning: true
      encryption: AES-256
      lifecycle_days: 90

Kubernetes Operators for Platform Services

Custom operators provision platform services via CRDs:

# Developer creates this:
apiVersion: platform.internal/v1
kind: Database
metadata:
  name: myapp-db
  namespace: myapp
spec:
  engine: postgres
  version: "15"
  size: medium
  backup:
    schedule: "0 2 * * *"
    retention: 30d

# The platform operator:
# 1. Provisions a PostgreSQL StatefulSet
# 2. Creates a Secret with credentials
# 3. Sets up automated backups
# 4. Configures monitoring & alerts
# 5. Registers in service catalog

Internal Developer Portal

What It Contains

Section Content
Service catalog All running services, owners, health status
Documentation API docs, runbooks, architecture diagrams
Templates Golden path templates for new services
Self-service Provisioning, DNS, certs, secrets
Scorecards Production readiness, security compliance
# catalog-info.yaml — registered in each service repo
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: grokdevops
  description: FastAPI service with Prometheus metrics
  annotations:
    github.com/project-slug: your-org/grokdevops
    prometheus.io/rule: 'rate(http_requests_total{service="grokdevops"}[5m])'
    argocd/app-name: grokdevops
  tags:
    - python
    - fastapi
spec:
  type: service
  lifecycle: production
  owner: platform-team
  system: core-infrastructure
  dependsOn:
    - resource:default/postgres
  providesApis:
    - grokdevops-api

CI/CD as a Platform Service

Standardized Pipelines

Instead of every team writing their own CI, provide reusable workflow templates:

# .github/workflows/ci.yml — in each service repo
name: CI
on: [push, pull_request]
jobs:
  build:
    uses: your-org/platform-workflows/.github/workflows/build-test-deploy.yml@v2
    with:
      language: python
      python-version: "3.11"
      helm-chart-path: deploy/helm
    secrets: inherit

The shared workflow handles: - Lint, test, build - Security scanning (Trivy, SAST) - Container image build + push - Helm chart validation - Image tag promotion to GitOps repo

Environment Promotion

main branch push
  → Build + Test
  → Push image (sha-abc123)
  → Update dev overlay (auto-deploy)
  → Run smoke tests against dev
  → Create staging PR (manual merge)
  → Run integration tests against staging
  → Create prod PR (requires approval)
  → Deploy to prod (canary → full)

Platform Metrics

What to measure to know if your platform is working:

Metric What It Tells You
Lead time for changes How long from commit to production
Deployment frequency How often teams can deploy
Time to onboard How long for a new service to reach production
Self-service ratio % of infra requests handled without a ticket
Mean time to recover How quickly teams can fix production issues
Platform adoption % of teams using golden paths vs custom
Developer satisfaction NPS or survey scores from platform users

Production Readiness

Scorecards

Automated checks that verify a service meets production standards:

production-readiness:
  required:
    - health-endpoint:     "Service exposes /health"
    - readiness-endpoint:  "Service exposes /ready"
    - resource-limits:     "CPU and memory limits set"
    - replicas:            "At least 2 replicas in prod"
    - monitoring:          "ServiceMonitor exists"
    - alerts:              "At least 1 alert rule defined"
    - runbook:             "Runbook linked in docs"
    - owner:               "Team ownership declared"

  recommended:
    - pdb:                 "PodDisruptionBudget configured"
    - hpa:                 "Autoscaling configured"
    - network-policy:      "Network policies in place"
    - security-scan:       "No critical vulnerabilities"
    - slo:                 "SLO defined and tracked"

Common Pitfalls

Interview tip: When asked about platform engineering in interviews, the key insight is: "Treat your platform as a product, your developers as customers." This means user research (talk to devs), product roadmap (prioritize by pain), adoption metrics (track usage), and iterating based on feedback — not mandating from above.

Analogy: A good internal platform is like a well-designed airport. Travelers (developers) should be able to check in, board, and fly (ship code) without understanding air traffic control, fuel logistics, or runway maintenance. But when something goes wrong, the control tower (platform team) needs full visibility into every layer.

  1. Building a platform nobody asked for — Talk to your developers first. Solve their actual pain points, not what you think they need.
  2. Mandating without value — If the golden path is harder than the cowpath, developers will route around it.
  3. Platform as bottleneck — If every request still needs a platform team member, you've just renamed "ops" to "platform."
  4. Over-abstracting — Hide complexity, but don't hide it so deeply that debugging becomes impossible.
  5. Ignoring the escape hatch — Some teams have legitimate reasons to go off the golden path. Make that possible (but tracked).

Wiki Navigation

Prerequisites

Next Steps