Policy Engines - Street-Level Ops¶
Quick Diagnosis¶
# List all Kyverno policies
kubectl get clusterpolicy
kubectl get policy -A
# Check policy violations
kubectl get policyreport -A
kubectl get clusterpolicyreport
# Detailed violation report
kubectl get policyreport -n grokdevops -o yaml | grep -A5 "result: fail"
# Kyverno webhook health
kubectl get pods -n kyverno
kubectl logs -n kyverno deploy/kyverno-admission-controller --tail=50
# Gatekeeper violations
kubectl get constraint -o yaml | grep -A10 "totalViolations"
# Check what's blocking a resource
kubectl get events --sort-by='.lastTimestamp' | grep -i "admission webhook"
Debug clue: If
kubectl applyhangs for 30s then fails, the admission webhook is timing out. Checkkubectl get validatingwebhookconfigurations -o yaml | grep timeoutSeconds— the default is 10s but a sick controller can stall every API call.One-liner: Count all policy violations across the cluster:
kubectl get policyreport -A -o json | jq '[.items[].results[]? | select(.result=="fail")] | length'
Gotcha: Policy Blocking Everything After Deploy¶
You deployed in Enforce mode without auditing first.
# Quick fix: switch to Audit
kubectl patch clusterpolicy <name> --type=json \
-p='[{"op":"replace","path":"/spec/validationFailureAction","value":"Audit"}]'
# Or if it's a Gatekeeper Constraint:
kubectl patch <kind> <name> --type=json \
-p='[{"op":"replace","path":"/spec/enforcementAction","value":"dryrun"}]'
Gotcha: Kyverno Webhook Down = Cluster Stuck¶
War story: A cluster went completely frozen at 3 AM because a Kyverno upgrade OOM-killed the admission controller. With
failurePolicy: Fail, evenkubectl applyfor the fix was blocked. The only escape was patching the webhook config directly.
If Kyverno pods crash and failurePolicy: Fail is set, no resources can be created.
# Check webhook configuration
kubectl get validatingwebhookconfigurations | grep kyverno
# Emergency: temporarily set failurePolicy to Ignore
kubectl patch validatingwebhookconfiguration kyverno-resource-validating-webhook-cfg \
--type=json -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'
# Fix Kyverno pods
kubectl get pods -n kyverno
kubectl describe pod -n kyverno -l app.kubernetes.io/component=admission-controller
kubectl rollout restart deployment kyverno-admission-controller -n kyverno
# Restore failurePolicy after Kyverno is healthy
kubectl patch validatingwebhookconfiguration kyverno-resource-validating-webhook-cfg \
--type=json -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Fail"}]'
Pattern: Policy as Code in Git¶
Store policies in Git alongside your application manifests:
platform/
policies/
require-limits.yaml
disallow-privileged.yaml
restrict-registries.yaml
require-labels.yaml
exceptions/
monitoring-privileged.yaml
Deploy with ArgoCD for GitOps-managed policy enforcement.
Default trap: Kyverno's
failurePolicydefaults toFailfor validating webhooks, meaning if Kyverno is unavailable, ALL resource creation/updates are blocked. Gatekeeper defaults toIgnore. Know which your cluster uses — a Kyverno crash at 3 AM withFailpolicy means nobody can deploy anything until it is restored.Gotcha: If you deploy policies via ArgoCD and the policy blocks its own sync, you get a deadlock. Use
argocd.argoproj.io/sync-wave: "-1"on exception resources so they apply before enforce policies.
Pattern: Namespace-Scoped Policies¶
Use Kyverno Policy (not ClusterPolicy) for team-specific rules:
apiVersion: kyverno.io/v1
kind: Policy
metadata:
name: team-specific-rules
namespace: team-a
spec:
validationFailureAction: Enforce
rules:
- name: require-team-label
match:
any:
- resources:
kinds:
- Pod
validate:
message: "All pods must have team=team-a label"
pattern:
metadata:
labels:
team: team-a
Remember: Kyverno policy mode mnemonic: Audit = Alert only, Enforce = Eject the bad resource. Always start with A before moving to E.
Pattern: Progressive Policy Enforcement¶
Week 1: Deploy all policies in Audit mode
Week 2: Review reports, fix violations in existing workloads
Week 3: Switch security policies to Enforce (privileged, registries)
Week 4: Switch operational policies to Enforce (limits, labels)
Ongoing: New policies always start in Audit
Under the hood: Kyverno and Gatekeeper both use Kubernetes admission webhooks. Every resource creation/update passes through the webhook, so policy engine latency directly impacts
kubectl applyresponse time. Monitorkyverno_admission_review_duration_secondsorgatekeeper_validation_request_duration_seconds.Interview tip: "How would you enforce container security policies across 50 teams?" Strong answer: admission webhooks (Kyverno or Gatekeeper) in a progressive rollout — audit first, enforce after teams fix violations. Mention that you would store policies in Git and deploy via GitOps for traceability.
Essential Policy Set¶
# Minimum viable policy set for any production cluster:
1. require-resource-limits (Enforce)
2. disallow-privileged (Enforce)
3. restrict-image-registries (Enforce)
4. disallow-latest-tag (Enforce)
5. require-run-as-non-root (Enforce)
6. require-labels (Audit -> Enforce)
7. disallow-host-path (Enforce)
8. generate-default-deny-netpol (Generate)
Scale note: At 100+ namespaces, Kyverno's background scan can spike CPU. Set
spec.background: falseon high-frequency policies (like label checks) and rely on admission-time enforcement only.