Platform Engineering Patterns - Street-Level Ops¶
Real-world patterns and gotchas from building and operating internal platforms.
Quick Diagnosis Commands¶
# Platform health — are all platform services running?
kubectl get pods -n platform-system
kubectl get pods -n argocd
kubectl get pods -n cert-manager
kubectl get pods -n monitoring
# Check template rendering works
helm template my-service ./service-template/deploy/helm \
-f values-dev.yaml --debug
# Verify golden path CI works end-to-end
gh workflow run ci.yml --ref test-branch
gh run list --workflow=ci.yml --limit 5
# Check which teams are using platform templates
gh search repos --owner your-org --topic platform-managed --json name | jq '.[].name'
# ArgoCD app health across all services
argocd app list --output json | jq '.[] | {name, health: .status.health.status, sync: .status.sync.status}'
# Certificate expiry check
kubectl get certificates -A -o json | jq '.items[] | {name: .metadata.name, ns: .metadata.namespace, expiry: .status.notAfter, ready: .status.conditions[0].status}'
One-liner: A platform without self-service is just a ticket queue with extra infrastructure.
Gotcha: Template Versioning Without Upgrades¶
You ship template v1. Fifty teams adopt it. You ship v2 with security fixes. The fifty existing repos are still on v1. There's no upgrade mechanism. You now maintain two versions forever.
Fix: Build upgrade tooling from day one. Options:
- Renovate/Dependabot for shared workflow versions
- A platform upgrade CLI command that patches existing repos
- Pin templates to semver tags and alert teams on outdated versions
Gotcha: Shared CI Workflows That Can't Be Debugged¶
Your reusable workflow abstracts away the build. When it fails, the developer sees Build failed in build-test-deploy.yml@v2 with a stack trace that references the shared workflow, not their code. They can't reproduce it locally.
Fix: Make shared workflows transparent:
- Print resolved commands before executing
- Provide a make ci-local target that runs the same steps locally
- Include a troubleshooting section in the workflow's README
Gotcha: Platform Team as Ticket Queue¶
You built self-service tooling, but the docs are thin. Developers don't know it exists. They file tickets. You resolve tickets manually. You're now the bottleneck you set out to eliminate.
Fix: Invest in documentation and onboarding as much as in the platform itself. Track self-service adoption rate. If < 80% of requests are self-served, your docs or UX need work.
Remember: The best platform metric is not uptime or deploy speed — it is ticket deflection rate. If developers can get from idea to production without filing a ticket, your platform is working. Track the ratio of self-served actions to support tickets weekly.
Gotcha: One Cluster, Many Tenants, No Isolation¶
All teams deploy to one Kubernetes cluster. No namespace quotas, no network policies, no resource limits enforced. One team's memory leak OOMs the node, taking down three other teams' services.
Fix: Enforce isolation at the platform level:
# Namespace template applied by platform controller
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
pods: "50"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: team-alpha
spec:
podSelector: {}
policyTypes: [Ingress, Egress]
Pattern: Platform CLI¶
Build a CLI that wraps platform operations:
# Create a new service from template
platform create service \
--name payment-api \
--lang go \
--team payments \
--tier critical
# Provision infrastructure
platform provision postgres \
--name payment-db \
--size medium \
--namespace payments
# Check production readiness
platform scorecard payment-api
# ✓ Health endpoint
# ✓ Resource limits
# ✓ Monitoring
# ✗ PDB not configured
# ✗ No runbook linked
# Score: 6/8 (75%) — Minimum: 80%
# Promote to production
platform promote payment-api --from staging --to prod
Pattern: Namespace-per-Team with Guardrails¶
# Platform controller watches for Team CRDs and provisions namespaces
apiVersion: platform.internal/v1
kind: Team
metadata:
name: payments
spec:
members:
- alice@company.com
- bob@company.com
tier: critical
quotas:
cpu: "16"
memory: 32Gi
pods: 100
features:
- postgres
- redis
- external-ingress
# Controller creates:
# 1. Namespace 'payments' with resource quotas
# 2. RBAC roles for team members
# 3. Network policies (deny by default, allow intra-namespace + platform services)
# 4. LimitRange (default resource requests/limits)
# 5. ServiceAccount with pull secrets
# 6. Monitoring namespace label for Prometheus
Pattern: GitOps-Driven Platform Config¶
platform-config/
├── clusters/
│ ├── prod-us-east/
│ │ ├── kustomization.yaml
│ │ └── cluster-values.yaml
│ └── prod-eu-west/
│ ├── kustomization.yaml
│ └── cluster-values.yaml
├── teams/
│ ├── payments/
│ │ ├── namespace.yaml
│ │ ├── quotas.yaml
│ │ ├── rbac.yaml
│ │ └── databases.yaml
│ └── search/
│ ├── namespace.yaml
│ ├── quotas.yaml
│ └── rbac.yaml
└── platform-services/
├── cert-manager/
├── external-dns/
├── prometheus-stack/
└── argocd/
Every change goes through a PR. ArgoCD applies it. Full audit trail.
Under the hood: ArgoCD polls the git repo every 3 minutes by default (configurable via
timeout.reconciliationin argocd-cm). For faster deployments, configure a GitHub webhook to trigger sync on push. For slower, safer rollouts, setsyncPolicy.automated.selfHeal: falseand require manual sync approval for production namespaces.
Pattern: Progressive Rollout of Platform Changes¶
When you update a platform component (e.g., new Ingress controller version):
1. Deploy to platform-dev cluster
→ Run platform integration tests
2. Deploy to one production namespace (canary team)
→ Monitor for 24 hours
3. Deploy to 10% of production namespaces
→ Monitor for 48 hours
4. Full rollout
→ Announce to all teams
# Label namespaces for progressive rollout
kubectl label namespace payments platform-rollout=canary
kubectl label namespace search platform-rollout=early-adopter
# ArgoCD ApplicationSet targets by label
# generators:
# - clusters:
# selector:
# matchLabels:
# platform-rollout: canary
Emergency: Platform Service Down¶
# 1. Check which platform components are unhealthy
for ns in platform-system argocd cert-manager monitoring ingress-nginx; do
echo "=== ${ns} ==="
kubectl get pods -n "${ns}" --no-headers | grep -v Running
done
# 2. Check if the issue is cascading
# (e.g., cert-manager down → certs expire → ingress fails → all services 503)
kubectl get certificates -A | grep -v True
# 3. Prioritize by blast radius
# Ingress down → all external traffic affected → fix first
# ArgoCD down → no deploys, but existing services keep running
# Cert-manager → no new certs, existing certs work until expiry
# 4. Check recent changes (what did we deploy?)
argocd app history platform-apps | head -10
kubectl get events -n platform-system --sort-by='.lastTimestamp' | tail -20
Gotcha: When cert-manager goes down, existing TLS certificates continue to work until they expire (typically 90 days for Let's Encrypt). This creates a false sense of safety — the real outage hits 60-90 days later when certs start expiring in a cascade. Monitor
certmanager_certificate_expiration_timestamp_secondsin Prometheus and alert at 14 days remaining.