Policy Engines - Street-Level Ops¶

Quick Diagnosis¶

# List all Kyverno policies
kubectl get clusterpolicy
kubectl get policy -A

# Check policy violations
kubectl get policyreport -A
kubectl get clusterpolicyreport

# Detailed violation report
kubectl get policyreport -n grokdevops -o yaml | grep -A5 "result: fail"

# Kyverno webhook health
kubectl get pods -n kyverno
kubectl logs -n kyverno deploy/kyverno-admission-controller --tail=50

# Gatekeeper violations
kubectl get constraint -o yaml | grep -A10 "totalViolations"

# Check what's blocking a resource
kubectl get events --sort-by='.lastTimestamp' | grep -i "admission webhook"

Debug clue: If kubectl apply hangs for 30s then fails, the admission webhook is timing out. Check kubectl get validatingwebhookconfigurations -o yaml | grep timeoutSeconds — the default is 10s but a sick controller can stall every API call.

One-liner: Count all policy violations across the cluster: kubectl get policyreport -A -o json | jq '[.items[].results[]? | select(.result=="fail")] | length'

Gotcha: Policy Blocking Everything After Deploy¶

You deployed in Enforce mode without auditing first.

# Quick fix: switch to Audit
kubectl patch clusterpolicy <name> --type=json \
  -p='[{"op":"replace","path":"/spec/validationFailureAction","value":"Audit"}]'

# Or if it's a Gatekeeper Constraint:
kubectl patch <kind> <name> --type=json \
  -p='[{"op":"replace","path":"/spec/enforcementAction","value":"dryrun"}]'

Gotcha: Kyverno Webhook Down = Cluster Stuck¶

War story: A cluster went completely frozen at 3 AM because a Kyverno upgrade OOM-killed the admission controller. With failurePolicy: Fail, even kubectl apply for the fix was blocked. The only escape was patching the webhook config directly.

If Kyverno pods crash and failurePolicy: Fail is set, no resources can be created.

# Check webhook configuration
kubectl get validatingwebhookconfigurations | grep kyverno

# Emergency: temporarily set failurePolicy to Ignore
kubectl patch validatingwebhookconfiguration kyverno-resource-validating-webhook-cfg \
  --type=json -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'

# Fix Kyverno pods
kubectl get pods -n kyverno
kubectl describe pod -n kyverno -l app.kubernetes.io/component=admission-controller
kubectl rollout restart deployment kyverno-admission-controller -n kyverno

# Restore failurePolicy after Kyverno is healthy
kubectl patch validatingwebhookconfiguration kyverno-resource-validating-webhook-cfg \
  --type=json -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Fail"}]'

Pattern: Policy as Code in Git¶

Store policies in Git alongside your application manifests:

platform/
  policies/
    require-limits.yaml
    disallow-privileged.yaml
    restrict-registries.yaml
    require-labels.yaml
  exceptions/
    monitoring-privileged.yaml

Deploy with ArgoCD for GitOps-managed policy enforcement.

Default trap: Kyverno's failurePolicy defaults to Fail for validating webhooks, meaning if Kyverno is unavailable, ALL resource creation/updates are blocked. Gatekeeper defaults to Ignore. Know which your cluster uses — a Kyverno crash at 3 AM with Fail policy means nobody can deploy anything until it is restored.

Gotcha: If you deploy policies via ArgoCD and the policy blocks its own sync, you get a deadlock. Use argocd.argoproj.io/sync-wave: "-1" on exception resources so they apply before enforce policies.

Pattern: Namespace-Scoped Policies¶

Use Kyverno Policy (not ClusterPolicy) for team-specific rules:

apiVersion: kyverno.io/v1
kind: Policy
metadata:
  name: team-specific-rules
  namespace: team-a
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-team-label
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "All pods must have team=team-a label"
        pattern:
          metadata:
            labels:
              team: team-a

Remember: Kyverno policy mode mnemonic: Audit = Alert only, Enforce = Eject the bad resource. Always start with A before moving to E.

Pattern: Progressive Policy Enforcement¶

Week 1: Deploy all policies in Audit mode
Week 2: Review reports, fix violations in existing workloads
Week 3: Switch security policies to Enforce (privileged, registries)
Week 4: Switch operational policies to Enforce (limits, labels)
Ongoing: New policies always start in Audit

Under the hood: Kyverno and Gatekeeper both use Kubernetes admission webhooks. Every resource creation/update passes through the webhook, so policy engine latency directly impacts kubectl apply response time. Monitor kyverno_admission_review_duration_seconds or gatekeeper_validation_request_duration_seconds.

Interview tip: "How would you enforce container security policies across 50 teams?" Strong answer: admission webhooks (Kyverno or Gatekeeper) in a progressive rollout — audit first, enforce after teams fix violations. Mention that you would store policies in Git and deploy via GitOps for traceability.

Essential Policy Set¶

# Minimum viable policy set for any production cluster:
1. require-resource-limits      (Enforce)
2. disallow-privileged          (Enforce)
3. restrict-image-registries    (Enforce)
4. disallow-latest-tag          (Enforce)
5. require-run-as-non-root      (Enforce)
6. require-labels               (Audit -> Enforce)
7. disallow-host-path           (Enforce)
8. generate-default-deny-netpol (Generate)

Scale note: At 100+ namespaces, Kyverno's background scan can spike CPU. Set spec.background: false on high-frequency policies (like label checks) and rely on admission-time enforcement only.