k8s
l0
topic-pack
k8s-ecosystem
k8s-operators --- Portal | Level: L0: Entry | Topics: K8s Ecosystem, Kubernetes Operators | Domain: Kubernetes

Kubernetes Ecosystem - Primer¶

Why This Matters¶

Kubernetes is a platform for building platforms. The core project provides container orchestration, but production clusters depend on a constellation of adjacent tools for packaging, networking, observability, security, and developer experience. Knowing what these tools are, when to reach for them, and how they fit together is what separates "I can deploy a pod" from "I can run production." This primer maps the ecosystem so you can navigate it without drowning.

Core Concepts¶

1. Package and Config Management¶

Helm is the dominant package manager — charts bundle K8s manifests with templating and versioning. See the dedicated Helm primer for depth.

Kustomize takes a different approach: overlay-based patching of plain YAML with no templating language. It ships built into kubectl (kubectl apply -k). Kustomize is simpler for small customizations; Helm is better when you need parameterized reuse across teams or environments.

# Kustomize: apply overlays without a template engine
kubectl apply -k overlays/production/

# Structure
base/
  deployment.yaml
  service.yaml
  kustomization.yaml
overlays/
  production/
    kustomization.yaml      # patches, replicas, image tags
    increase-replicas.yaml  # strategic merge patch

When to use which: Helm for distributing charts to others or managing complex multi-resource apps. Kustomize for environment-specific overrides on top of plain manifests. Many teams use both — Helm to install third-party charts, Kustomize to patch them.

Jsonnet and CUE are data languages that generate YAML/JSON. Jsonnet is popular in the Prometheus/Grafana ecosystem (mixins, dashboards-as-code). CUE adds type constraints and validation. Both are niche compared to Helm/Kustomize but appear in large-scale operations.

2. GitOps and Continuous Delivery¶

GitOps treats git as the single source of truth for cluster state. A controller running in-cluster watches a git repo and reconciles the cluster to match.

ArgoCD is the most widely adopted GitOps tool. It provides a UI, RBAC, multi-cluster support, and application-of-applications patterns. It watches git repos (or Helm repos, or OCI artifacts) and syncs declared state to clusters.

Flux (Flux CD v2) is the CNCF alternative. It is more modular — separate controllers for source, Kustomize, Helm, notifications, and image automation. No built-in UI (uses Weave GitOps or Capacitor). Flux excels when you want fine-grained controller composition.

# ArgoCD: create an application pointing at a git repo
argocd app create myapp \
  --repo https://github.com/org/k8s-manifests.git \
  --path environments/production \
  --dest-server https://kubernetes.default.svc \
  --dest-namespace production \
  --sync-policy automated --self-heal --auto-prune

# Flux: bootstrap and define a Kustomization
flux bootstrap github --owner=org --repository=fleet --path=clusters/production
flux create kustomization myapp \
  --source=GitRepository/fleet \
  --path=./apps/production \
  --prune=true \
  --interval=5m

Key difference: ArgoCD is application-centric (you define apps, it syncs them). Flux is source-centric (you define sources and reconcilers). ArgoCD has a richer UI; Flux has tighter Helm integration and image update automation built in.

3. Service Mesh¶

A service mesh handles service-to-service communication: mTLS, traffic routing, retries, circuit breaking, and observability — without application code changes.

Istio is the most feature-rich mesh. It uses Envoy sidecars injected into every pod. Powerful but operationally heavy — the control plane (istiod) and per-pod sidecars add CPU, memory, and latency. Istio is the right choice when you need fine-grained traffic management (canary routing, fault injection, rate limiting) or strict mTLS policy.

Linkerd is lighter weight. It uses a Rust-based micro-proxy instead of Envoy. Simpler to operate, lower resource overhead, faster to install. Linkerd is the right choice when you mainly need mTLS and golden metrics (latency, throughput, success rate) without the full Istio feature set.

Cilium is primarily a CNI (networking) but has grown mesh capabilities via eBPF. It can provide mTLS and L7 policy without sidecars. Cilium is increasingly chosen as a "mesh-lite" option that avoids the sidecar overhead entirely.

See the dedicated Service Mesh primer for operational depth.

4. Networking and Ingress¶

Kubernetes networking has multiple layers, each with ecosystem choices:

CNI (Container Network Interface) plugins provide pod-to-pod networking: - Calico — BGP-based, strong network policy support, widely deployed - Cilium — eBPF-based, fastest dataplane, rich L7 policy, growing fast - Flannel — simple VXLAN overlay, minimal features, good for dev/learning clusters - Weave Net — mesh overlay, easy setup, less common in production now

Ingress controllers route external traffic to services: - Ingress-NGINX — the default choice, mature, well-documented - Traefik — auto-discovery, Let's Encrypt integration, middleware chains - HAProxy Ingress — high-performance, config-driven - Contour — Envoy-based, from VMware/Projectcontour

Gateway API is the successor to the Ingress resource. It provides richer routing (HTTP routing, traffic splitting, header matching) with a role-oriented model (infrastructure provider, cluster operator, application developer). Most ingress controllers and meshes are adding Gateway API support. New projects should prefer Gateway API over Ingress resources.

CoreDNS is the default cluster DNS. It resolves service names to ClusterIPs. You rarely configure it directly, but understanding it matters when debugging DNS resolution failures (the most common "networking" issue that isn't actually networking).

5. Observability¶

The observability stack is one of the most developed parts of the K8s ecosystem.

Prometheus is the standard for metrics. It scrapes /metrics endpoints, stores time-series data, and evaluates alerting rules. Nearly every K8s component and operator exposes Prometheus metrics. The kube-prometheus-stack Helm chart bundles Prometheus, Alertmanager, Grafana, and node-exporter in one install.

Grafana is the visualization layer. Pre-built dashboards exist for virtually every K8s component. Grafana also serves as the frontend for Loki (logs) and Tempo (traces).

Loki is Grafana's log aggregation system — like Prometheus but for logs. It indexes metadata (labels) rather than full-text, making it cheaper to run than Elasticsearch. Paired with Promtail or Grafana Alloy (formerly Grafana Agent) as the log shipper.

OpenTelemetry (OTel) is the vendor-neutral standard for traces, metrics, and logs. The OpenTelemetry Collector receives, processes, and exports telemetry data. OTel is replacing vendor-specific instrumentation SDKs. The OTel Operator can auto-instrument workloads.

Jaeger provides distributed tracing — tracking requests across microservices. It can consume OTel trace data. Useful for latency investigation and dependency mapping.

See the dedicated Observability primer and OpenTelemetry primer for depth.

6. Secrets and Policy¶

External Secrets Operator (ESO) syncs secrets from external stores (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, Azure Key Vault) into Kubernetes Secrets. This is the modern pattern — secrets live in a purpose-built store, ESO keeps K8s in sync.

Sealed Secrets (Bitnami) lets you encrypt secrets and commit them to git. A controller in the cluster decrypts them. Good for GitOps workflows where you want secrets in the repo but encrypted.

HashiCorp Vault is the heavyweight secrets management platform. The Vault Agent Injector or Vault CSI Provider integrates it with K8s. Vault provides dynamic secrets, PKI, encryption-as-a-service, and lease management. Operationally complex but powerful for large organizations.

OPA/Gatekeeper enforces policies on K8s resources via admission webhooks. You write policies in Rego (OPA's language) that gate what can be created — e.g., no privileged containers, required labels, image registry allowlists. Gatekeeper is the K8s-native wrapper around OPA.

Kyverno is a K8s-native policy engine that uses YAML instead of Rego. Policies are Kubernetes resources. Kyverno can validate, mutate, and generate resources. Lower learning curve than OPA/Gatekeeper, increasingly popular.

See the dedicated Secrets Management primer and Policy Engines primer.

7. Storage¶

Kubernetes storage is plugin-based via the Container Storage Interface (CSI).

CSI drivers are provided by cloud vendors (EBS CSI, GCE PD CSI, Azure Disk CSI) and storage vendors. They implement dynamic provisioning, snapshots, and resize. Every production cluster needs at least one CSI driver configured.

Longhorn (Rancher/SUSE) provides distributed block storage built on top of local disks. Good for bare-metal and edge clusters without cloud-provider storage.

Rook deploys and manages Ceph on Kubernetes. Ceph provides block, file, and object storage. Rook is operationally heavy but provides a full storage platform for on-prem clusters.

Key concepts: StorageClasses define provisioner + parameters. PersistentVolumeClaims (PVCs) request storage. PersistentVolumes (PVs) are the actual backing storage. Dynamic provisioning means the CSI driver creates PVs automatically when PVCs are created.

See the dedicated K8s Storage primer.

8. Developer and Local Tools¶

k3s is a lightweight K8s distribution from Rancher. Single binary, low resource usage, includes Traefik and local-path-provisioner. Ideal for edge, IoT, CI, and homelab clusters. Production-grade despite the small footprint.

kind (Kubernetes in Docker) runs K8s clusters using Docker containers as nodes. Fast to create and destroy. Primary use: CI/CD pipelines and local testing.

minikube runs a single-node cluster locally (VM or container). More features than kind (addons, dashboard, LoadBalancer emulation) but heavier.

Tilt and Skaffold are dev-loop tools. They watch source code, build images, and deploy to a local cluster automatically. Tilt has a UI and is more opinionated; Skaffold is a CLI tool with pipeline definitions. Both eliminate the manual build-push-deploy cycle during development.

Telepresence connects your local machine to a remote cluster's network. You can run a service locally while it receives traffic from the cluster. Useful for debugging services that depend on cluster resources.

9. Operators and CRDs¶

Custom Resource Definitions (CRDs) extend the Kubernetes API with new resource types. An Operator is a controller that watches CRDs and reconciles the desired state — essentially encoding operational knowledge into software.

Operator SDK (from Red Hat) and kubebuilder (from SIG API Machinery) are the two main frameworks for building operators. Both generate scaffolding, handle boilerplate, and provide testing utilities. Kubebuilder is lower-level; Operator SDK wraps it with additional tooling (Ansible/Helm-based operators, OLM integration).

Operators are everywhere in the ecosystem: the Prometheus Operator manages Prometheus instances via ServiceMonitor CRDs, cert-manager manages TLS certificates via Certificate CRDs, and database operators (CloudNativePG, Zalando Postgres Operator) manage database clusters.

See the K8s Operators drills for hands-on practice.

10. Multi-Cluster and Platform Engineering¶

Crossplane turns your Kubernetes cluster into a universal control plane. You define cloud infrastructure (RDS databases, S3 buckets, VPCs) as Kubernetes resources, and Crossplane provisions them via cloud provider APIs. It bridges the gap between Terraform (imperative, CLI-driven) and Kubernetes (declarative, controller-driven).

Cluster API (CAPI) manages the lifecycle of Kubernetes clusters themselves — provisioning, upgrading, and scaling clusters using Kubernetes resources. "Kubernetes managing Kubernetes."

vcluster creates virtual clusters inside a host cluster. Each vcluster has its own API server and control plane but shares the underlying worker nodes. Useful for multi-tenancy, CI environments, and dev sandboxes without the cost of separate clusters.

See the dedicated Platform Engineering primer.

11. CLI Tools¶

Beyond kubectl, several CLI tools improve the K8s operator experience:

Tool	Purpose
k9s	Terminal UI for cluster management — navigate resources, view logs, exec into pods
kubectx / kubens	Fast context and namespace switching
stern	Multi-pod log tailing with color-coded output
kubectl-neat	Clean up `kubectl get -o yaml` output (remove managed fields, status)
kubecolor	Colorized kubectl output
helm-diff	Preview Helm upgrade changes before applying
kubectl tree	Visualize resource ownership hierarchy

# k9s: terminal UI
k9s -n production

# kubectx: switch cluster context
kubectx staging

# kubens: switch default namespace
kubens kube-system

# stern: tail logs from multiple pods
stern myapp -n production --since 15m

# kubectl neat: clean output for review
kubectl get deploy myapp -o yaml | kubectl neat

12. Certificate Management¶

cert-manager is the de facto standard for automating TLS certificate lifecycle in Kubernetes. It integrates with Let's Encrypt, Vault, Venafi, and other CAs. Define Certificate and Issuer resources, and cert-manager handles issuance, renewal, and secret creation.

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: nginx
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: myapp-tls
  namespace: production
spec:
  secretName: myapp-tls-secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - myapp.example.com
    - api.example.com

How These Pieces Fit Together¶

A typical production stack might look like:

Layer               Tool choices
──────────────────  ─────────────────────────────
Cluster provision   Cluster API / EKS / GKE / k3s
CNI                 Cilium or Calico
Ingress             Ingress-NGINX + Gateway API
Certificates        cert-manager + Let's Encrypt
Package mgmt        Helm + Kustomize overlays
GitOps              ArgoCD or Flux
Secrets             External Secrets Operator + Vault
Policy              Kyverno or OPA/Gatekeeper
Observability       Prometheus + Grafana + Loki + OTel
Service mesh        Istio or Linkerd (if needed)
Storage             Cloud CSI driver + Longhorn (bare metal)
Dev experience      Tilt or Skaffold + k9s + stern

You do not need all of these on day one. Start with: Helm, a CNI, an ingress controller, cert-manager, and Prometheus. Add GitOps, policy, and mesh as the cluster and team grow.

Common Gotchas¶

CRD version skew. Operators install CRDs. Upgrading an operator may change CRD versions. If you have multiple tools depending on the same CRDs (e.g., Prometheus CRDs used by kube-prometheus-stack and Victoria Metrics Operator), version conflicts can break things. Pin CRD versions and upgrade coordinately.

Resource overhead. Each ecosystem tool adds CPU, memory, and API server load. A "full stack" cluster with mesh, GitOps, policy engine, and observability can consume 4-8 GB of memory in system workloads before you run a single application pod. Size your nodes accordingly.

YAML sprawl. The ecosystem loves YAML. Without structure (Helm, Kustomize, directory conventions), manifests multiply uncontrollably. Establish a repo layout early and enforce it.

Operator conflicts. Two operators managing the same resource type will fight. Only one controller should own each resource kind in a cluster.

The CNCF landscape is not a shopping list. The CNCF Cloud Native Landscape lists 1000+ projects. Most are niche, immature, or abandoned. Stick to graduated and incubating projects unless you have a specific need and the team to support it.

Deep Dive: Kubernetes Operators & CRDs¶

Why Operators Matter¶

Operators encode operational knowledge into software. Instead of a runbook that says "when the database needs scaling, do X, Y, Z," an operator watches the cluster and does X, Y, Z automatically. Understanding operators is essential for running stateful workloads and for extending Kubernetes with domain-specific automation.

Custom Resource Definitions (CRDs)¶

A CRD extends the Kubernetes API with your own resource types. After creating a CRD, you can kubectl get your custom resources just like pods or services.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.example.com
spec:
  group: example.com
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                engine:
                  type: string
                  enum: [postgres, mysql]
                version:
                  type: string
                replicas:
                  type: integer
                  minimum: 1
                storage:
                  type: string
              required: [engine, version]
  scope: Namespaced
  names:
    plural: databases
    singular: database
    kind: Database
    shortNames:
      - db

The Reconciliation Loop¶

1. Watch: Controller watches for changes to CRs
2. Compare: Desired state (CR spec) vs actual state (cluster)
3. Act: Create/update/delete resources to match desired state
4. Status: Update CR status with current state
5. Repeat

Operator Maturity Model¶

Level	Capability	Example
1	Basic install	Helm chart wrapper
2	Seamless upgrades	Rolling updates, version migration
3	Full lifecycle	Backup, restore, scaling
4	Deep insights	Metrics, alerts, log analysis
5	Auto-pilot	Auto-scaling, auto-tuning, self-healing

Building Operators — Frameworks¶

Framework	Language	Complexity	Best for
Kubebuilder	Go	Medium	Production operators
Operator SDK	Go/Ansible/Helm	Medium	Red Hat ecosystem
Kopf	Python	Low	Quick prototypes, Python shops
Metacontroller	Any (webhooks)	Low	Simple use cases
Shell-operator	Bash/Python	Low	Quick automation scripts

Controller Logic Pattern (Go)¶

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    // 1. Fetch the custom resource
    var db appsv1.Database
    if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 2. Check if StatefulSet exists
    var sts appsv1.StatefulSet
    err := r.Get(ctx, types.NamespacedName{Name: db.Name, Namespace: db.Namespace}, &sts)
    if errors.IsNotFound(err) {
        // 3. Create it
        sts = r.buildStatefulSet(&db)
        if err := r.Create(ctx, &sts); err != nil {
            return ctrl.Result{}, err
        }
        log.Info("Created StatefulSet", "name", db.Name)
    }

    // 4. Update status
    db.Status.ReadyReplicas = sts.Status.ReadyReplicas
    if err := r.Status().Update(ctx, &db); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}

Well-Known Operators¶

Operator	What it manages	Maturity
Prometheus Operator	Prometheus, Alertmanager, ServiceMonitor	Level 5
cert-manager	TLS certificates, issuers	Level 4
Strimzi	Apache Kafka clusters	Level 4
CloudNativePG	PostgreSQL clusters	Level 4
Zalando Postgres Operator	PostgreSQL with patroni	Level 4
Rook/Ceph	Distributed storage	Level 5

Owner References & Garbage Collection¶

When an operator creates child resources, it should set owner references so Kubernetes garbage collects them when the CR is deleted.

Finalizers¶

Finalizers let your operator run cleanup logic before a CR is deleted. The operator detects the deletion timestamp, runs cleanup (e.g., take final backup), removes the finalizer, and Kubernetes completes the deletion.

Common Operator Pitfalls¶

No owner references — Child resources become orphaned when CR is deleted
Missing RBAC — Operator needs ClusterRole with permissions for all resources it manages
No idempotent reconciliation — Reconcile must handle being called multiple times for the same state
Unbounded requeueing — Always set a maximum requeue delay to avoid hammering the API
CRD versioning — Plan for schema evolution. Use conversion webhooks for breaking changes
Status subresource — Always use the status subresource for status updates (not spec)

Next Steps¶

Kubernetes Operators Drills (Drill, L3)
Skillcheck: Kubernetes Operators (Assessment, L3)

Kubernetes Operators Drills (Drill, L3) — K8s Ecosystem
Kubernetes Operators Flashcards (CLI) (flashcard_deck, L1) — Kubernetes Operators
Skillcheck: Kubernetes Operators (Assessment, L3) — K8s Ecosystem

Kubernetes Ecosystem - Primer¶

Why This Matters¶

Core Concepts¶

1. Package and Config Management¶

2. GitOps and Continuous Delivery¶

3. Service Mesh¶

4. Networking and Ingress¶

5. Observability¶

6. Secrets and Policy¶

7. Storage¶

8. Developer and Local Tools¶

9. Operators and CRDs¶

10. Multi-Cluster and Platform Engineering¶

11. CLI Tools¶

12. Certificate Management¶

How These Pieces Fit Together¶

Common Gotchas¶

Deep Dive: Kubernetes Operators & CRDs¶

Why Operators Matter¶

Custom Resource Definitions (CRDs)¶

The Reconciliation Loop¶

Operator Maturity Model¶

Building Operators — Frameworks¶

Controller Logic Pattern (Go)¶

Well-Known Operators¶

Owner References & Garbage Collection¶

Finalizers¶

Common Operator Pitfalls¶

Wiki Navigation¶

Next Steps¶

Pages that link here¶

Kubernetes Ecosystem - Primer¶

Why This Matters¶

Core Concepts¶

1. Package and Config Management¶

2. GitOps and Continuous Delivery¶

3. Service Mesh¶

4. Networking and Ingress¶

5. Observability¶

6. Secrets and Policy¶

7. Storage¶

8. Developer and Local Tools¶

9. Operators and CRDs¶

10. Multi-Cluster and Platform Engineering¶

11. CLI Tools¶

12. Certificate Management¶

How These Pieces Fit Together¶

Common Gotchas¶

Deep Dive: Kubernetes Operators & CRDs¶

Why Operators Matter¶

Custom Resource Definitions (CRDs)¶

The Reconciliation Loop¶

Operator Maturity Model¶

Building Operators — Frameworks¶

Controller Logic Pattern (Go)¶

Well-Known Operators¶

Owner References & Garbage Collection¶

Finalizers¶

Common Operator Pitfalls¶

Wiki Navigation¶

Next Steps¶

Related Content¶

Pages that link here¶