Skip to content

Portal | Level: L3: Advanced | Topics: K8s Ecosystem | Domain: Kubernetes

Kubernetes Operators & CRDs Drills

Remember: An operator is a CRD + a controller. The CRD defines "what you want" (desired state), the controller does "how to get there" (reconciliation loop). The controller watches for changes to the custom resource and takes action to make reality match the spec. This is the same pattern Kubernetes itself uses — Deployments have a controller that ensures the right number of pods exist.

Gotcha: Deleting a CRD deletes ALL custom resources of that type — including their data. If you uninstall an operator that manages databases, deleting the CRD can cascade-delete all your database resources. Always check kubectl get <crd-name> --all-namespaces before removing a CRD, and use finalizers to prevent accidental deletion.

Drill 1: What Is a CRD?

Difficulty: Easy

Q: Explain what a CRD is and how it extends the Kubernetes API. Give an example.

Answer A **Custom Resource Definition (CRD)** extends the Kubernetes API with new resource types. Once a CRD is created, you can `kubectl get`, `create`, `delete` the custom resource just like built-in resources.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.example.com
spec:
  group: example.com
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              engine:
                type: string
                enum: ["postgres", "mysql"]
              version:
                type: string
              replicas:
                type: integer
  scope: Namespaced
  names:
    plural: databases
    singular: database
    kind: Database
    shortNames: ["db"]
kubectl get databases
kubectl describe database my-app-db
A CRD alone is just a data store. An **operator** (controller) watches CRDs and takes action.

Drill 2: Reconciliation Loop

Difficulty: Medium

Q: Explain the reconciliation loop pattern that all Kubernetes operators follow. What triggers reconciliation?

Answer
                    ┌──────────────┐
                    │    Watch     │
                    │  (informer)  │
                    └──────┬───────┘
                           │ event
                    ┌──────▼───────┐
                    │    Queue     │
                    │ (work queue) │
                    └──────┬───────┘
                           │ dequeue
                    ┌──────▼───────┐
          ┌────────►│  Reconcile   │
          │         │  (your code) │
          │         └──────┬───────┘
          │                │
          │         ┌──────▼───────┐
          │         │  Desired ==  │
          │    No   │  Actual?     │
          ├─────────┤              │
          │         └──────┬───────┘
          │                │ Yes
          │         ┌──────▼───────┐
          └─────────│   Requeue    │
                    │  (periodic)  │
                    └──────────────┘
Triggers: 1. **Create/Update/Delete** of the watched resource 2. **Changes to owned resources** (e.g., a Pod owned by the operator crashes) 3. **Periodic re-sync** (configurable interval) 4. **Manual requeue** from within the reconcile function Key principle: Reconcile is **idempotent**. It compares desired state (spec) with actual state and makes adjustments. It should be safe to call repeatedly.

Drill 3: Owner References

Difficulty: Medium

Q: What are owner references and why are they critical for operators?

Answer Owner references create a parent-child relationship between resources. When the parent is deleted, Kubernetes garbage-collects all children.
apiVersion: v1
kind: Pod
metadata:
  name: db-pod-0
  ownerReferences:
  - apiVersion: example.com/v1
    kind: Database
    name: my-db
    uid: abc-123
    controller: true
    blockOwnerDeletion: true
Why they matter: 1. **Garbage collection**: Delete the Database CR → all Pods, Services, PVCs are cleaned up 2. **Event propagation**: Changes to owned resources trigger reconciliation of the owner 3. **Prevents orphans**: No leaked resources when CRs are deleted
// In an operator, set owner reference:
ctrl.SetControllerReference(database, pod, r.Scheme)
Without owner references, deleting a CR would leave orphaned pods, services, and PVCs.

Drill 4: Finalizers

Difficulty: Medium

Q: What is a finalizer and when would you use one in an operator?

Answer A finalizer is a string on a resource's `metadata.finalizers` list that prevents deletion until the operator removes it.
metadata:
  finalizers:
  - databases.example.com/cleanup
Deletion flow: 1. User runs `kubectl delete database my-db` 2. Kubernetes sets `deletionTimestamp` but does NOT delete the resource 3. Operator's reconcile function is called 4. Operator performs cleanup (e.g., drop database, remove external resources, revoke credentials) 5. Operator removes the finalizer from the list 6. Kubernetes deletes the resource Use finalizers when: - You create external resources (cloud databases, DNS records, IAM roles) - You need to run cleanup logic before deletion - You need to coordinate with external systems
// Add finalizer
controllerutil.AddFinalizer(database, "databases.example.com/cleanup")

// Check if being deleted
if !database.DeletionTimestamp.IsZero() {
    // Run cleanup
    cleanupExternalResources(database)
    // Remove finalizer
    controllerutil.RemoveFinalizer(database, "databases.example.com/cleanup")
}

Drill 5: Status Subresource

Difficulty: Medium

Q: Why should operators use the status subresource instead of updating the entire CR?

Answer The status subresource allows separate RBAC and update semantics for `.spec` (desired state) vs `.status` (observed state).
# In the CRD definition
versions:
- name: v1
  subresources:
    status: {}
Benefits: 1. **RBAC separation**: Users can update spec, only the operator updates status 2. **No conflict**: Updating status doesn't conflict with spec updates (different API endpoints) 3. **Convention**: spec = user intent, status = operator observations
# Status example
status:
  phase: Running
  replicas: 3
  readyReplicas: 3
  conditions:
  - type: Ready
    status: "True"
    lastTransitionTime: "2024-01-15T10:00:00Z"
  - type: BackupComplete
    status: "True"
    lastTransitionTime: "2024-01-15T06:00:00Z"
// Update status (separate API call)
r.Status().Update(ctx, database)
// vs updating the whole resource
r.Update(ctx, database)

Drill 6: Kubebuilder Scaffold

Difficulty: Easy

Q: How do you scaffold a new operator project using Kubebuilder?

Answer
# Initialize project
kubebuilder init --domain example.com --repo github.com/org/db-operator

# Create API (CRD + Controller)
kubebuilder create api --group database --version v1 --kind Database

# Key files created:
# api/v1/database_types.go    — CRD spec/status structs
# controllers/database_controller.go — Reconcile logic
# config/crd/bases/           — Generated CRD YAML

# Edit the types
vi api/v1/database_types.go

# Regenerate manifests after type changes
make manifests

# Run locally (against current kubeconfig)
make run

# Build and push image
make docker-build docker-push IMG=registry.example.com/db-operator:v1

# Deploy to cluster
make deploy IMG=registry.example.com/db-operator:v1

Drill 7: Operator Maturity Levels

Difficulty: Easy

Q: What are the 5 operator capability levels? Give an example of each.

Answer | Level | Name | Capabilities | Example | |-------|------|-------------|---------| | 1 | Basic Install | Automated install, CRD, operator lifecycle | Operator deploys app via CR | | 2 | Seamless Upgrades | Version upgrades, patch management | Operator upgrades Postgres 15→16 | | 3 | Full Lifecycle | Backup, restore, failure recovery | Automated backup to S3, PITR | | 4 | Deep Insights | Metrics, alerts, log processing | Custom Grafana dashboards, SLO monitoring | | 5 | Auto Pilot | Auto-scaling, auto-tuning, anomaly detection | Auto-adjusts buffer pool, auto-failover | Most open-source operators are Level 2-3. Fully automated (Level 5) is rare.

Drill 8: Debug a Stuck Operator

Difficulty: Hard

Q: Your custom operator is running but CRs stay in "Pending" state and never transition to "Running". How do you debug?

Answer
# 1. Check operator pod logs
kubectl logs -n operator-system deploy/my-operator-controller-manager -f
# Look for errors, panics, or RBAC denied messages

# 2. Check if operator is watching the right namespace
kubectl get deploy -n operator-system -o yaml | grep -A5 WATCH_NAMESPACE

# 3. Check RBAC — does the operator SA have the right permissions?
kubectl auth can-i get databases --as=system:serviceaccount:operator-system:my-operator-sa
kubectl auth can-i create pods --as=system:serviceaccount:operator-system:my-operator-sa
kubectl auth can-i update databases/status --as=system:serviceaccount:operator-system:my-operator-sa

# 4. Check events on the CR
kubectl describe database my-db
# Look at Events section

# 5. Check if the reconcile function is being called
# Add debug logging in the reconcile function
# Or check controller-runtime metrics:
kubectl port-forward -n operator-system svc/my-operator-metrics 8443:8443
curl -k https://localhost:8443/metrics | grep reconcile

# 6. Common issues:
# - Missing RBAC for status subresource updates
# - Reconcile returning error but not logging it
# - Watching wrong group/version/kind
# - Leader election stuck (previous pod still holds lease)

Wiki Navigation

Prerequisites