- k8s
- l0
- topic-pack
- k8s-ecosystem
- k8s-operators --- Portal | Level: L0: Entry | Topics: K8s Ecosystem, Kubernetes Operators | Domain: Kubernetes
Kubernetes Ecosystem - Primer¶
Why This Matters¶
Kubernetes is a platform for building platforms. The core project provides container orchestration, but production clusters depend on a constellation of adjacent tools for packaging, networking, observability, security, and developer experience. Knowing what these tools are, when to reach for them, and how they fit together is what separates "I can deploy a pod" from "I can run production." This primer maps the ecosystem so you can navigate it without drowning.
Core Concepts¶
1. Package and Config Management¶
Helm is the dominant package manager — charts bundle K8s manifests with templating and versioning. See the dedicated Helm primer for depth.
Kustomize takes a different approach: overlay-based patching of plain YAML with no templating language. It ships built into kubectl (kubectl apply -k). Kustomize is simpler for small customizations; Helm is better when you need parameterized reuse across teams or environments.
# Kustomize: apply overlays without a template engine
kubectl apply -k overlays/production/
# Structure
base/
deployment.yaml
service.yaml
kustomization.yaml
overlays/
production/
kustomization.yaml # patches, replicas, image tags
increase-replicas.yaml # strategic merge patch
When to use which: Helm for distributing charts to others or managing complex multi-resource apps. Kustomize for environment-specific overrides on top of plain manifests. Many teams use both — Helm to install third-party charts, Kustomize to patch them.
Jsonnet and CUE are data languages that generate YAML/JSON. Jsonnet is popular in the Prometheus/Grafana ecosystem (mixins, dashboards-as-code). CUE adds type constraints and validation. Both are niche compared to Helm/Kustomize but appear in large-scale operations.
2. GitOps and Continuous Delivery¶
GitOps treats git as the single source of truth for cluster state. A controller running in-cluster watches a git repo and reconciles the cluster to match.
ArgoCD is the most widely adopted GitOps tool. It provides a UI, RBAC, multi-cluster support, and application-of-applications patterns. It watches git repos (or Helm repos, or OCI artifacts) and syncs declared state to clusters.
Flux (Flux CD v2) is the CNCF alternative. It is more modular — separate controllers for source, Kustomize, Helm, notifications, and image automation. No built-in UI (uses Weave GitOps or Capacitor). Flux excels when you want fine-grained controller composition.
# ArgoCD: create an application pointing at a git repo
argocd app create myapp \
--repo https://github.com/org/k8s-manifests.git \
--path environments/production \
--dest-server https://kubernetes.default.svc \
--dest-namespace production \
--sync-policy automated --self-heal --auto-prune
# Flux: bootstrap and define a Kustomization
flux bootstrap github --owner=org --repository=fleet --path=clusters/production
flux create kustomization myapp \
--source=GitRepository/fleet \
--path=./apps/production \
--prune=true \
--interval=5m
Key difference: ArgoCD is application-centric (you define apps, it syncs them). Flux is source-centric (you define sources and reconcilers). ArgoCD has a richer UI; Flux has tighter Helm integration and image update automation built in.
3. Service Mesh¶
A service mesh handles service-to-service communication: mTLS, traffic routing, retries, circuit breaking, and observability — without application code changes.
Istio is the most feature-rich mesh. It uses Envoy sidecars injected into every pod. Powerful but operationally heavy — the control plane (istiod) and per-pod sidecars add CPU, memory, and latency. Istio is the right choice when you need fine-grained traffic management (canary routing, fault injection, rate limiting) or strict mTLS policy.
Linkerd is lighter weight. It uses a Rust-based micro-proxy instead of Envoy. Simpler to operate, lower resource overhead, faster to install. Linkerd is the right choice when you mainly need mTLS and golden metrics (latency, throughput, success rate) without the full Istio feature set.
Cilium is primarily a CNI (networking) but has grown mesh capabilities via eBPF. It can provide mTLS and L7 policy without sidecars. Cilium is increasingly chosen as a "mesh-lite" option that avoids the sidecar overhead entirely.
See the dedicated Service Mesh primer for operational depth.
4. Networking and Ingress¶
Kubernetes networking has multiple layers, each with ecosystem choices:
CNI (Container Network Interface) plugins provide pod-to-pod networking: - Calico — BGP-based, strong network policy support, widely deployed - Cilium — eBPF-based, fastest dataplane, rich L7 policy, growing fast - Flannel — simple VXLAN overlay, minimal features, good for dev/learning clusters - Weave Net — mesh overlay, easy setup, less common in production now
Ingress controllers route external traffic to services: - Ingress-NGINX — the default choice, mature, well-documented - Traefik — auto-discovery, Let's Encrypt integration, middleware chains - HAProxy Ingress — high-performance, config-driven - Contour — Envoy-based, from VMware/Projectcontour
Gateway API is the successor to the Ingress resource. It provides richer routing (HTTP routing, traffic splitting, header matching) with a role-oriented model (infrastructure provider, cluster operator, application developer). Most ingress controllers and meshes are adding Gateway API support. New projects should prefer Gateway API over Ingress resources.
CoreDNS is the default cluster DNS. It resolves service names to ClusterIPs. You rarely configure it directly, but understanding it matters when debugging DNS resolution failures (the most common "networking" issue that isn't actually networking).
5. Observability¶
The observability stack is one of the most developed parts of the K8s ecosystem.
Prometheus is the standard for metrics. It scrapes /metrics endpoints, stores time-series data, and evaluates alerting rules. Nearly every K8s component and operator exposes Prometheus metrics. The kube-prometheus-stack Helm chart bundles Prometheus, Alertmanager, Grafana, and node-exporter in one install.
Grafana is the visualization layer. Pre-built dashboards exist for virtually every K8s component. Grafana also serves as the frontend for Loki (logs) and Tempo (traces).
Loki is Grafana's log aggregation system — like Prometheus but for logs. It indexes metadata (labels) rather than full-text, making it cheaper to run than Elasticsearch. Paired with Promtail or Grafana Alloy (formerly Grafana Agent) as the log shipper.
OpenTelemetry (OTel) is the vendor-neutral standard for traces, metrics, and logs. The OpenTelemetry Collector receives, processes, and exports telemetry data. OTel is replacing vendor-specific instrumentation SDKs. The OTel Operator can auto-instrument workloads.
Jaeger provides distributed tracing — tracking requests across microservices. It can consume OTel trace data. Useful for latency investigation and dependency mapping.
See the dedicated Observability primer and OpenTelemetry primer for depth.
6. Secrets and Policy¶
External Secrets Operator (ESO) syncs secrets from external stores (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, Azure Key Vault) into Kubernetes Secrets. This is the modern pattern — secrets live in a purpose-built store, ESO keeps K8s in sync.
Sealed Secrets (Bitnami) lets you encrypt secrets and commit them to git. A controller in the cluster decrypts them. Good for GitOps workflows where you want secrets in the repo but encrypted.
HashiCorp Vault is the heavyweight secrets management platform. The Vault Agent Injector or Vault CSI Provider integrates it with K8s. Vault provides dynamic secrets, PKI, encryption-as-a-service, and lease management. Operationally complex but powerful for large organizations.
OPA/Gatekeeper enforces policies on K8s resources via admission webhooks. You write policies in Rego (OPA's language) that gate what can be created — e.g., no privileged containers, required labels, image registry allowlists. Gatekeeper is the K8s-native wrapper around OPA.
Kyverno is a K8s-native policy engine that uses YAML instead of Rego. Policies are Kubernetes resources. Kyverno can validate, mutate, and generate resources. Lower learning curve than OPA/Gatekeeper, increasingly popular.
See the dedicated Secrets Management primer and Policy Engines primer.
7. Storage¶
Kubernetes storage is plugin-based via the Container Storage Interface (CSI).
CSI drivers are provided by cloud vendors (EBS CSI, GCE PD CSI, Azure Disk CSI) and storage vendors. They implement dynamic provisioning, snapshots, and resize. Every production cluster needs at least one CSI driver configured.
Longhorn (Rancher/SUSE) provides distributed block storage built on top of local disks. Good for bare-metal and edge clusters without cloud-provider storage.
Rook deploys and manages Ceph on Kubernetes. Ceph provides block, file, and object storage. Rook is operationally heavy but provides a full storage platform for on-prem clusters.
Key concepts: StorageClasses define provisioner + parameters. PersistentVolumeClaims (PVCs) request storage. PersistentVolumes (PVs) are the actual backing storage. Dynamic provisioning means the CSI driver creates PVs automatically when PVCs are created.
See the dedicated K8s Storage primer.
8. Developer and Local Tools¶
k3s is a lightweight K8s distribution from Rancher. Single binary, low resource usage, includes Traefik and local-path-provisioner. Ideal for edge, IoT, CI, and homelab clusters. Production-grade despite the small footprint.
kind (Kubernetes in Docker) runs K8s clusters using Docker containers as nodes. Fast to create and destroy. Primary use: CI/CD pipelines and local testing.
minikube runs a single-node cluster locally (VM or container). More features than kind (addons, dashboard, LoadBalancer emulation) but heavier.
Tilt and Skaffold are dev-loop tools. They watch source code, build images, and deploy to a local cluster automatically. Tilt has a UI and is more opinionated; Skaffold is a CLI tool with pipeline definitions. Both eliminate the manual build-push-deploy cycle during development.
Telepresence connects your local machine to a remote cluster's network. You can run a service locally while it receives traffic from the cluster. Useful for debugging services that depend on cluster resources.
9. Operators and CRDs¶
Custom Resource Definitions (CRDs) extend the Kubernetes API with new resource types. An Operator is a controller that watches CRDs and reconciles the desired state — essentially encoding operational knowledge into software.
Operator SDK (from Red Hat) and kubebuilder (from SIG API Machinery) are the two main frameworks for building operators. Both generate scaffolding, handle boilerplate, and provide testing utilities. Kubebuilder is lower-level; Operator SDK wraps it with additional tooling (Ansible/Helm-based operators, OLM integration).
Operators are everywhere in the ecosystem: the Prometheus Operator manages Prometheus instances via ServiceMonitor CRDs, cert-manager manages TLS certificates via Certificate CRDs, and database operators (CloudNativePG, Zalando Postgres Operator) manage database clusters.
See the K8s Operators drills for hands-on practice.
10. Multi-Cluster and Platform Engineering¶
Crossplane turns your Kubernetes cluster into a universal control plane. You define cloud infrastructure (RDS databases, S3 buckets, VPCs) as Kubernetes resources, and Crossplane provisions them via cloud provider APIs. It bridges the gap between Terraform (imperative, CLI-driven) and Kubernetes (declarative, controller-driven).
Cluster API (CAPI) manages the lifecycle of Kubernetes clusters themselves — provisioning, upgrading, and scaling clusters using Kubernetes resources. "Kubernetes managing Kubernetes."
vcluster creates virtual clusters inside a host cluster. Each vcluster has its own API server and control plane but shares the underlying worker nodes. Useful for multi-tenancy, CI environments, and dev sandboxes without the cost of separate clusters.
See the dedicated Platform Engineering primer.
11. CLI Tools¶
Beyond kubectl, several CLI tools improve the K8s operator experience:
| Tool | Purpose |
|---|---|
| k9s | Terminal UI for cluster management — navigate resources, view logs, exec into pods |
| kubectx / kubens | Fast context and namespace switching |
| stern | Multi-pod log tailing with color-coded output |
| kubectl-neat | Clean up kubectl get -o yaml output (remove managed fields, status) |
| kubecolor | Colorized kubectl output |
| helm-diff | Preview Helm upgrade changes before applying |
| kubectl tree | Visualize resource ownership hierarchy |
# k9s: terminal UI
k9s -n production
# kubectx: switch cluster context
kubectx staging
# kubens: switch default namespace
kubens kube-system
# stern: tail logs from multiple pods
stern myapp -n production --since 15m
# kubectl neat: clean output for review
kubectl get deploy myapp -o yaml | kubectl neat
12. Certificate Management¶
cert-manager is the de facto standard for automating TLS certificate lifecycle in Kubernetes. It integrates with Let's Encrypt, Vault, Venafi, and other CAs. Define Certificate and Issuer resources, and cert-manager handles issuance, renewal, and secret creation.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@example.com
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- http01:
ingress:
class: nginx
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: myapp-tls
namespace: production
spec:
secretName: myapp-tls-secret
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- myapp.example.com
- api.example.com
How These Pieces Fit Together¶
A typical production stack might look like:
Layer Tool choices
────────────────── ─────────────────────────────
Cluster provision Cluster API / EKS / GKE / k3s
CNI Cilium or Calico
Ingress Ingress-NGINX + Gateway API
Certificates cert-manager + Let's Encrypt
Package mgmt Helm + Kustomize overlays
GitOps ArgoCD or Flux
Secrets External Secrets Operator + Vault
Policy Kyverno or OPA/Gatekeeper
Observability Prometheus + Grafana + Loki + OTel
Service mesh Istio or Linkerd (if needed)
Storage Cloud CSI driver + Longhorn (bare metal)
Dev experience Tilt or Skaffold + k9s + stern
You do not need all of these on day one. Start with: Helm, a CNI, an ingress controller, cert-manager, and Prometheus. Add GitOps, policy, and mesh as the cluster and team grow.
Common Gotchas¶
CRD version skew. Operators install CRDs. Upgrading an operator may change CRD versions. If you have multiple tools depending on the same CRDs (e.g., Prometheus CRDs used by kube-prometheus-stack and Victoria Metrics Operator), version conflicts can break things. Pin CRD versions and upgrade coordinately.
Resource overhead. Each ecosystem tool adds CPU, memory, and API server load. A "full stack" cluster with mesh, GitOps, policy engine, and observability can consume 4-8 GB of memory in system workloads before you run a single application pod. Size your nodes accordingly.
YAML sprawl. The ecosystem loves YAML. Without structure (Helm, Kustomize, directory conventions), manifests multiply uncontrollably. Establish a repo layout early and enforce it.
Operator conflicts. Two operators managing the same resource type will fight. Only one controller should own each resource kind in a cluster.
The CNCF landscape is not a shopping list. The CNCF Cloud Native Landscape lists 1000+ projects. Most are niche, immature, or abandoned. Stick to graduated and incubating projects unless you have a specific need and the team to support it.
Deep Dive: Kubernetes Operators & CRDs¶
Why Operators Matter¶
Operators encode operational knowledge into software. Instead of a runbook that says "when the database needs scaling, do X, Y, Z," an operator watches the cluster and does X, Y, Z automatically. Understanding operators is essential for running stateful workloads and for extending Kubernetes with domain-specific automation.
Custom Resource Definitions (CRDs)¶
A CRD extends the Kubernetes API with your own resource types. After creating a CRD, you can kubectl get your custom resources just like pods or services.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.example.com
spec:
group: example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
engine:
type: string
enum: [postgres, mysql]
version:
type: string
replicas:
type: integer
minimum: 1
storage:
type: string
required: [engine, version]
scope: Namespaced
names:
plural: databases
singular: database
kind: Database
shortNames:
- db
The Reconciliation Loop¶
1. Watch: Controller watches for changes to CRs
2. Compare: Desired state (CR spec) vs actual state (cluster)
3. Act: Create/update/delete resources to match desired state
4. Status: Update CR status with current state
5. Repeat
Operator Maturity Model¶
| Level | Capability | Example |
|---|---|---|
| 1 | Basic install | Helm chart wrapper |
| 2 | Seamless upgrades | Rolling updates, version migration |
| 3 | Full lifecycle | Backup, restore, scaling |
| 4 | Deep insights | Metrics, alerts, log analysis |
| 5 | Auto-pilot | Auto-scaling, auto-tuning, self-healing |
Building Operators — Frameworks¶
| Framework | Language | Complexity | Best for |
|---|---|---|---|
| Kubebuilder | Go | Medium | Production operators |
| Operator SDK | Go/Ansible/Helm | Medium | Red Hat ecosystem |
| Kopf | Python | Low | Quick prototypes, Python shops |
| Metacontroller | Any (webhooks) | Low | Simple use cases |
| Shell-operator | Bash/Python | Low | Quick automation scripts |
Controller Logic Pattern (Go)¶
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the custom resource
var db appsv1.Database
if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. Check if StatefulSet exists
var sts appsv1.StatefulSet
err := r.Get(ctx, types.NamespacedName{Name: db.Name, Namespace: db.Namespace}, &sts)
if errors.IsNotFound(err) {
// 3. Create it
sts = r.buildStatefulSet(&db)
if err := r.Create(ctx, &sts); err != nil {
return ctrl.Result{}, err
}
log.Info("Created StatefulSet", "name", db.Name)
}
// 4. Update status
db.Status.ReadyReplicas = sts.Status.ReadyReplicas
if err := r.Status().Update(ctx, &db); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
Well-Known Operators¶
| Operator | What it manages | Maturity |
|---|---|---|
| Prometheus Operator | Prometheus, Alertmanager, ServiceMonitor | Level 5 |
| cert-manager | TLS certificates, issuers | Level 4 |
| Strimzi | Apache Kafka clusters | Level 4 |
| CloudNativePG | PostgreSQL clusters | Level 4 |
| Zalando Postgres Operator | PostgreSQL with patroni | Level 4 |
| Rook/Ceph | Distributed storage | Level 5 |
Owner References & Garbage Collection¶
When an operator creates child resources, it should set owner references so Kubernetes garbage collects them when the CR is deleted.
Finalizers¶
Finalizers let your operator run cleanup logic before a CR is deleted. The operator detects the deletion timestamp, runs cleanup (e.g., take final backup), removes the finalizer, and Kubernetes completes the deletion.
Common Operator Pitfalls¶
- No owner references — Child resources become orphaned when CR is deleted
- Missing RBAC — Operator needs ClusterRole with permissions for all resources it manages
- No idempotent reconciliation — Reconcile must handle being called multiple times for the same state
- Unbounded requeueing — Always set a maximum requeue delay to avoid hammering the API
- CRD versioning — Plan for schema evolution. Use conversion webhooks for breaking changes
- Status subresource — Always use the status subresource for status updates (not spec)
Wiki Navigation¶
Next Steps¶
- Kubernetes Operators Drills (Drill, L3)
- Skillcheck: Kubernetes Operators (Assessment, L3)
Related Content¶
- Kubernetes Operators Drills (Drill, L3) — K8s Ecosystem
- Kubernetes Operators Flashcards (CLI) (flashcard_deck, L1) — Kubernetes Operators
- Skillcheck: Kubernetes Operators (Assessment, L3) — K8s Ecosystem
Pages that link here¶
- Anti-Primer: Kubernetes Ecosystem
- Certification Prep: CKAD — Certified Kubernetes Application Developer
- Comparison: Kubernetes Templating
- Comparison: Local Dev for Kubernetes
- Comparison: Local Kubernetes Clusters
- Helm - Primer
- K8S Ecosystem
- Kubernetes Operators & CRDs - Skill Check
- Kubernetes Operators & CRDs Drills
- Kubernetes Storage - Primer
- Level 6: Advanced Platform Engineering
- Master Curriculum: 40 Weeks
- OpenTelemetry - Primer
- Platform Engineering Patterns - Primer
- Policy Engines (OPA/Kyverno) - Primer