Policy Engine Footguns¶

Mistakes that block all deployments, create security gaps, or make policies impossible to maintain.

1. Cluster-wide policy that blocks everything¶

You deploy a Kyverno ClusterPolicy that requires all pods to have resource limits. You forgot about system pods in kube-system that don't have limits set. CoreDNS can't restart. DNS fails. Everything breaks.

Fix: Always exclude system namespaces: exclude: { namespaces: [kube-system, cert-manager, istio-system] }. Test policies in Audit mode before Enforce.

War story: A widely-reported 2023 incident involved a team deploying a Kyverno policy requiring labels on all pods cluster-wide. The policy blocked CoreDNS from restarting after a node drain, causing a cascading DNS failure across the entire cluster.

2. Deploying in Enforce mode without testing¶

You write a policy and set it to Enforce immediately. It has a bug — the match expression is too broad. Half the cluster's pods can't be created or updated. Deploys fail across all teams.

Fix: Always deploy policies in Audit mode first. Review violations for a week. Fix false positives. Then switch to Enforce.

3. Regex that matches too much (or too little)¶

Your image allowlist policy: registry.example.com/*. It's supposed to only allow images from your registry. But it also matches registry.example.com.evil.com/backdoor. Or it doesn't match registry.example.com:5000/app.

Fix: Use anchored regexes: ^registry\.example\.com/. Test your regex against both valid and invalid inputs. Include port numbers and subpaths in your pattern.

Gotcha: OPA Rego's re_match does partial matching by default — re_match("example.com", "example.com.evil.com") returns true. Always anchor with ^ and $ in Rego regex patterns.

4. Policy that prevents emergency rollbacks¶

Your policy requires all images to be signed. During an incident, you need to rollback to an old unsigned image. The policy blocks the rollback. You're stuck — the broken version is running and you can't deploy the working version.

Fix: Have a break-glass procedure. Use a policy exception annotation that requires approval. Or maintain a "emergency bypass" namespace. Document and practice the procedure.

5. Mutation policies with unexpected ordering¶

You have two mutating policies: one adds a sidecar, another sets resource limits. The order they run determines whether the sidecar gets resource limits. If the limit policy runs first, the sidecar (added second) has no limits.

Fix: Understand policy execution order. Use policy ordering annotations. Test with kubectl apply --dry-run=server to see the final mutated resource.

6. OPA Gatekeeper constraint with no constraint template¶

You deploy a Constraint but the ConstraintTemplate hasn't synced yet. The constraint silently does nothing. You think you're protected but the policy isn't enforcing. Months later, an audit reveals the gap.

Fix: Always deploy ConstraintTemplate first, verify it's available, then deploy the Constraint. Monitor for constraint violations count — if it's always zero, the constraint might not be working.

7. Audit mode data overload¶

You enable audit on all policies. Thousands of existing violations flood in. The audit results overwhelm Gatekeeper's memory. It OOMs. Now you have no policy enforcement at all.

Fix: Enable audit incrementally — one policy at a time. Set audit.constraints.auditInterval and audit.constraints.auditBatchSize appropriately. Use auditFromCache to reduce API server load.

8. Policy exception sprawl¶

Teams can't deploy because of policies. Instead of fixing their manifests, they request exceptions. Six months later, you have 200 exceptions and the policy is effectively meaningless. You're maintaining a complex system that enforces nothing.

Fix: Track exceptions with expiry dates. Require justification and security review. Dashboard the exception count per team. Review and revoke expired exceptions monthly.

9. Webhook timeout causing cluster instability¶

Your policy engine's webhook takes 2 seconds to respond. Kubernetes has a 10-second timeout. Under load, webhook latency spikes to 12 seconds. API server requests start timing out. kubectl commands fail. Deploys fail. Even kubectl get pods is slow.

Fix: Set failurePolicy: Ignore on non-critical policies (allows requests if webhook is down). Monitor webhook latency. Size the policy engine pods appropriately. Use timeoutSeconds: 5 on the webhook.

Debug clue: If kubectl commands suddenly become slow or timeout, check admission webhook latency first: kubectl get validatingwebhookconfigurations -o yaml | grep timeout. Webhook-induced latency looks like API server slowness but the API server itself is healthy.

10. Forgetting that policies don't apply retroactively¶

You add a policy requiring all pods to run as non-root. Existing pods that run as root keep running. The policy only blocks new pods. You think you're compliant, but 30% of running pods violate the policy.

Fix: Run audit/dry-run to find existing violations. Fix them manually. Use admission controllers for new resources AND periodic scanning for existing ones.

Remember: Admission controllers only fire on CREATE and UPDATE operations. Running pods, existing Secrets, and deployed ConfigMaps are invisible to admission policies until something triggers a new API call for that resource.