Kubernetes Ecosystem Footguns¶

Mistakes that cause outages or security incidents.

1. Two Controllers Managing the Same Resource¶

You install both the Prometheus Operator (from kube-prometheus-stack) and the Victoria Metrics Operator, both of which manage ServiceMonitor CRDs. Or you have Flux and ArgoCD both configured to reconcile the same namespace. Two controllers race to set the resource to their desired state — each undoes the other's changes. Resources flicker between states, reconciliation loops spike CPU on the control plane, and neither tool achieves its desired state. The failure is confusing because both controllers report they're working.

Fix: One controller per resource kind per namespace. Before adding a new tool to a cluster, audit what CRDs and resource types it manages: kubectl get crds | grep monitoring.coreos.com. If two tools overlap, choose one and disable or remove the conflicting component in the other. For GitOps, designate one system (ArgoCD or Flux, not both) as authoritative per namespace.

Debug clue: The classic symptom of two controllers fighting is a resource whose .metadata.resourceVersion increments rapidly even when no human is making changes. Run kubectl get <resource> -w and watch for rapid updates with alternating values in .spec or .status. Also check controller logs for "conflict" or "object has been modified" errors — these indicate optimistic locking failures from concurrent writes.

2. Helm + ArgoCD Without `argocd.argoproj.io/managed-by` — Constant OutOfSync¶

You install a tool with Helm, then create an ArgoCD Application pointing at the same Helm chart. ArgoCD takes over management. On the next manual helm upgrade, ArgoCD detects drift and immediately reverts your change. The operators get confused why their Helm upgrades "don't stick." Worse, they start running helm upgrade and ArgoCD sync in parallel, with unpredictable results depending on timing.

Fix: Pick one: Helm CLI or ArgoCD — not both — as the deployment mechanism for a given release. If you want GitOps (ArgoCD/Flux), stop using helm upgrade manually and commit values changes to git instead. If ArgoCD manages a Helm chart, set syncPolicy.automated appropriately and use argocd app set for value overrides, not direct Helm CLI commands. Run helm list -A and argocd app list and confirm no release appears in both.

3. CRD Version Skew After Operator Upgrade¶

You upgrade the cert-manager Helm chart from v1.10 to v1.14. The new version changes the CRD schema for Certificate resources. Your existing Certificate objects were valid in v1.10 but now fail validation under v1.14's stricter schema. Kubernetes rejects attempts to update them, and cert-manager's controller can't reconcile them, so TLS renewals silently stop working until certs expire.

Fix: Read the migration guide for every operator upgrade — CRD changes are common and breaking. Before upgrading, run kubectl get crds and check the operator's changelog for CRD version bumps. Use kubectl diff or helm diff upgrade to preview changes. For cert-manager specifically, run cmctl check api --wait=2m before and after upgrade. For any operator managing critical resources, test upgrades in staging with the exact same manifests used in production.

4. ArgoCD Auto-Sync + `--prune` Deletes Manually Created Resources¶

You enable syncPolicy.automated.prune: true on an ArgoCD Application. A developer creates a temporary debugging Pod or ConfigMap in the managed namespace using kubectl apply. ArgoCD's next sync cycle — which runs every 3 minutes by default — detects the resource is not in git and prunes (deletes) it. The developer loses their debugging state with no warning, and if the "temporary" resource was actually important (e.g., a manually created Secret), it's gone.

Fix: Understand that --prune is a powerful but aggressive setting. Resources in ArgoCD-managed namespaces should be managed by ArgoCD — do not kubectl apply things manually in those namespaces. For temporary debugging, use a separate non-managed namespace, or use kubectl run --dry-run patterns. If a resource should be permanent but isn't in git yet, add it to git before enabling prune. Mark resources you want to protect with argocd.argoproj.io/sync-options: Prune=false.

5. cert-manager Let's Encrypt Rate Limits Exhausted¶

You're testing cert-manager with the production Let's Encrypt issuer and repeatedly create and delete Certificate resources to troubleshoot a DNS-01 challenge. Let's Encrypt's rate limit is 5 duplicate certificates per domain per week. After a few iterations, all certificate issuance fails with 429 Too Many Requests — affecting not just your test but all other certs for the same domain. Production TLS renewals may be blocked for days.

Fix: Always use the Let's Encrypt staging issuer during development and troubleshooting:

spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory

Staging issues fake certs (not trusted by browsers but functionally identical for testing). Switch to the production issuer only when everything is verified to work. Never use the production issuer in CI or dev clusters.

6. Full-Stack Installation Without Resource Sizing — Control Plane OOM¶

You install the "full observability stack" — Prometheus, Grafana, Loki, Tempo, OTel Collector, ArgoCD, cert-manager, and External Secrets Operator — on a 3-node cluster with 4GB RAM per node. Within 24 hours, the Prometheus pod is OOMKilled because it's scraping thousands of metrics from every component on every node. The Loki pod fills its PVC. ArgoCD's Redis runs out of memory. Node pressure triggers evictions and the cluster becomes unstable.

Fix: Size your system namespace workloads before installing the stack. The Prometheus community recommends 2-8GB RAM for production Prometheus depending on target count and retention. Add explicit resources.requests and resources.limits to every system component. Use kubectl top pods -n monitoring after 24 hours to baseline actual consumption. Start with a smaller retention period (15d vs 90d) and scale up. Use Prometheus remote write to ship metrics off-cluster if nodes are resource-constrained.

7. External Secrets Operator SecretStore Credentials in Plain Kubernetes Secret¶

ESO uses a Kubernetes Secret to store credentials for the external secret backend (Vault token, AWS access key, etc.). That Kubernetes Secret is often created with kubectl create secret generic vault-token --from-literal=token=s.abc123 — which means the plaintext token is in your shell history, and the secret lives in etcd without envelope encryption. If etcd backups are stored plaintext, the credential is exposed in every backup.

Fix: Enable etcd encryption at rest in your cluster (EncryptionConfiguration with AES-GCM or KMS provider). For ESO with Vault specifically, use the Kubernetes auth method (pod's service account JWT) instead of a long-lived Vault token — ESO supports this natively, and it eliminates the long-lived credential entirely. For AWS, use IRSA (IAM Roles for Service Accounts) instead of static access keys.

8. Ingress-NGINX Default Backend Serves Confidential Error Details¶

Your Ingress-NGINX installation uses the default configuration. When a backend pod is unhealthy or the path doesn't exist, Nginx returns 502/404 error pages that include internal server names, upstream IP addresses, and sometimes stack traces proxied from the backend. These details help attackers map your internal network topology.

Fix: Configure a custom default backend for Ingress-NGINX that returns generic error pages:

controller:
  defaultBackend:
    enabled: true
  config:
    custom-http-errors: "404,502,503"

Disable server tokens (server-tokens: "false") to remove NGINX version from headers. Ensure your application backends return generic error messages that don't expose internal details — Nginx proxies whatever the application returns, so this is an application-level concern as well.

9. Flux Image Update Automation Pushes to Production on Every Commit¶

You enable Flux's Image Update Automation to automatically bump image tags in git when new images are pushed to the registry. The automation is configured to watch :latest or a semver pattern that includes pre-release tags. Every developer push to main triggers an image build, which triggers Flux to update the git manifest, which triggers a production deploy — with no human review, staging gate, or approval step.

Fix: Scope Image Update Automation narrowly. Use a specific tag pattern that only matches deliberately promoted images (e.g., semver:>=1.0.0 to match v1.2.3 but not latest or main-abc123). Configure automation to update a staging branch only, with a PR-based promotion to production. Add a policy object that restricts which tags trigger updates:

imagePolicy:
  filterTags:
    pattern: '^v[0-9]+\.[0-9]+\.[0-9]+$'  # only semver release tags

10. Installing Every CNCF Project That Looks Interesting¶

A new team member reads the CNCF landscape and installs Falco, Tetragon, Kyverno, OPA/Gatekeeper, Polaris, Trivy Operator, and RBAC Manager — all in the same cluster. Each tool adds webhooks, DaemonSets, and CRDs. Admission webhook timeouts from any one of them can block all pod creation. Multiple policy engines conflict on the same resources. The cluster becomes fragile and slow, and nobody fully understands what any single tool is doing.

Fix: The CNCF landscape is not a shopping list. Install one tool per concern: one policy engine (Kyverno or Gatekeeper, not both), one security scanner (Trivy Operator or Falco, not all of them). Before installing any new ecosystem tool, ask: what problem does this solve that existing tools don't? Who owns the operational responsibility for this tool? What happens when it breaks? Prefer CNCF graduated projects over incubating ones for production workloads.

Operator-Specific Footguns¶

Mistakes that crash your operator, corrupt custom resources, or create reconciliation loops from hell.

11. Reconcile loop that hammers the API server¶

Your reconciler doesn't use exponential backoff on errors. A transient failure causes the reconciler to retry immediately, thousands of times per second. The API server gets overloaded. Every operator in the cluster slows down.

Fix: Use ctrl.Result{RequeueAfter: time.Second * 30} for retries. Implement exponential backoff. Set rate limiters on the controller manager.

12. Finalizer that blocks deletion forever¶

Your operator adds a finalizer to clean up external resources. The cleanup code has a bug and always fails. Now the CR can't be deleted — ever. kubectl delete hangs. --force doesn't work because finalizers aren't removed by force delete.

Fix: Implement robust cleanup with error handling. Add a timeout for cleanup operations. As a last resort, patch out the finalizer: kubectl patch mycr/name --type json -p '[{"op": "remove", "path": "/metadata/finalizers"}]'.

13. Status update triggering reconciliation¶

Your reconciler updates the CR's .status field. This triggers a watch event. The reconciler runs again, updates status again, triggering another watch. Infinite loop.

Fix: Use client.StatusClient to update status separately. Use GenerationChangedPredicate to only reconcile on .spec changes, not .status changes.

14. CRD schema too permissive¶

Your CRD has x-kubernetes-preserve-unknown-fields: true and no validation. Users create CRs with typos in field names that silently do nothing. replcias: 5 (typo for replicas) is accepted without error.

Fix: Define a strict OpenAPI schema. Validate all fields. Use enum for fixed choices. Use defaulting webhooks for optional fields.

15. Upgrading CRD without migration¶

You add a required field to the CRD v2 schema. Existing v1 CRs don't have this field. The operator crashes trying to access the field, or worse, operates on existing CRs with zero-value defaults.

Fix: Use CRD versioning (v1alpha1 -> v1beta1 -> v1). Implement conversion webhooks between versions. Make new fields optional with defaults. Backfill existing CRs before making fields required.

16. Operator with cluster-admin permissions¶

Your operator's RBAC gives it ClusterRole with */* permissions because figuring out the exact permissions was hard. A bug in the operator or a compromise of the operator pod now has full cluster access.

Fix: Follow least privilege. Grant only the specific resources and verbs the operator needs. Use Role (namespaced) instead of ClusterRole when possible.

17. No leader election in HA setup¶

You run 2 replicas of your operator for high availability. Both replicas reconcile the same CRs simultaneously. They fight — one creates a resource, the other tries to create the same resource and fails. Resources get duplicated or corrupted.

Fix: Enable leader election in the controller manager. Only the leader reconciles. The standby takes over if the leader pod dies.

18. Owning resources you don't create¶

Your operator sets ownerReferences on resources it watches but doesn't create. When the parent CR is deleted, Kubernetes garbage-collects the owned resources — resources that belong to another team's workload.

Fix: Only set ownerReferences on resources your operator creates. Use labels for association without ownership. Understand the difference between ownerReferences (GC) and labels (querying).

19. Watch all namespaces when you mean one¶

Your operator watches CRs in all namespaces. Someone creates a CR in a namespace your operator wasn't designed for. The operator tries to create resources there, fails because of RBAC, and fills logs with errors.

Fix: Scope the operator to specific namespaces using cache.Options{Namespaces: []string{"production"}}. Or handle multi-namespace gracefully with per-namespace config.

20. No readiness/liveness probes on the operator¶

Your operator pod has no probes. It panics and enters a death loop. Kubernetes thinks it's healthy because there's no probe to say otherwise. CRs stop being reconciled but no alert fires because the pod is "Running."

Fix: Add health and readiness probes to the operator. Kubebuilder scaffolds these by default (/healthz, /readyz). Monitor reconciliation lag, not just pod status.