Portal | Level: L2 | Domain: Security
Open Policy Agent Footguns¶
Mistakes that silently pass all requests, lock down your cluster, or make policies impossible to audit.
1. Writing imperative Rego (if/else chains instead of declarative rules)¶
You come from Python or Go and write nested conditions, else branches, and assignment chains expecting sequential evaluation. Your policy works on the happy path but fails on edge cases you didn't anticipate, because Rego evaluates all rule bodies independently — not as a flow.
Fix: Embrace multiple rule bodies as OR. Every condition inside a body is AND'd. Write one body per case, not one body with branching. Use opa test -v to confirm each branch is exercised by a test.
2. Not testing policies before deploying (opa test absent from CI)¶
You write Rego and push directly to the bundle server. The policy has a typo in a variable name that makes the allow rule never true. Gatekeeper now blocks all new pod admissions in the namespace — silently, from the developer's perspective.
Fix: Run opa check --strict (syntax + types) and opa test -v (unit tests) in CI before any policy change merges. A test file with at least one allow case and one deny case is the minimum bar for every policy.
3. Gatekeeper webhook failurePolicy: Fail with no HA¶
You install Gatekeeper with a single replica and failurePolicy: Fail (the security-hardened default). The Gatekeeper pod crashes during a node drain. The Kubernetes API server cannot reach the webhook. Every admission request fails. Your deployment pipeline is dead until someone manually deletes the webhook configuration.
Fix: Run Gatekeeper with at least two replicas and a PodDisruptionBudget. Apply namespaceSelector to exclude kube-system, gatekeeper-system, and any namespace that must stay available for cluster recovery. In non-security-critical clusters, use failurePolicy: Ignore and compensate with aggressive audit monitoring.
War story: Multiple production clusters have experienced "admission webhook deadlock" where Gatekeeper goes down and blocks its own re-deployment because the webhook rejects the Gatekeeper pods themselves. The fix is the
namespaceSelectorexclusion — without it, Gatekeeper's own namespace must be excluded or the webhook becomes a self-referencing failure loop.
4. Bundle download failure causes stale policies silently¶
OPA can't reach the bundle server (S3 outage, network policy change). It silently continues serving the last-loaded bundle. No errors are surfaced to calling services. Your "latest" policy is actually three weeks old and missing a critical security rule you shipped last week.
Fix: Set /health?bundles=true as your OPA liveness probe — it returns 500 if bundles haven't loaded successfully. Alert on the opa_bundle_last_success_time_seconds Prometheus metric. Set a reasonable max_delay_seconds in bundle config and treat bundle lag as an incident.
5. Not using partial rules — one giant rule instead¶
You write a single allow rule with 15 conditions chained together. When it fails, you have no idea which condition fired. You also can't reuse any logic across other policies. The rule is untestable in isolation.
Fix: Break complex logic into partial rules (for sets/collections of results) and helper rules (named boolean expressions). Name them semantically: user_is_admin, request_is_read_only, image_from_approved_registry. Each can be tested independently and reused across packages.
6. Default allow without explicit deny (security gap)¶
You write:
This means everything is allowed by default, and you deny only what you explicitly enumerate. You forgot the DELETE /admin endpoint. Guests can delete production data.
Fix: Default deny. Always start with default allow = false. Then enumerate the conditions under which allow is true. Allowlisting is safer than denylisting — if you miss a case, access is blocked rather than opened.
7. Ignoring decision logs (no audit trail)¶
You deploy OPA but never configure decision logging. Six months later, there's a security incident. The question is: did the policy allow or deny the request that caused the breach? You have no answer. The logs were never collected.
Fix: Configure decision logging from day one, even in dev. Ship logs to your SIEM or object storage. Log input, result, query, and timestamp at minimum. Use mask_decision_log_path to redact sensitive fields rather than disabling logging entirely.
8. ConstraintTemplate Rego errors silently passing all requests¶
You have a syntax error or a runtime panic in your ConstraintTemplate Rego. Gatekeeper marks the template as errored but — depending on version and configuration — may pass all requests that would have been evaluated by the broken template. You think your policy is enforcing; it is not.
Fix: Check ConstraintTemplate status after every deploy:
Run gator test in CI against your ConstraintTemplates before pushing. Alert if any ConstraintTemplate enters an error state.
9. Not scoping Gatekeeper constraints — applying to kube-system¶
Your new constraint requires all pods to have a team label. You forget to exclude kube-system. The next time kube-dns or the metrics server pod is evicted and re-scheduled, Gatekeeper blocks it. Your cluster's DNS stops resolving.
Fix: Always add a namespaceSelector exclusion for kube-system and gatekeeper-system in every Constraint. Use Gatekeeper's Config resource to define a global namespace exclusion list, then apply it consistently.
10. Rego performance: nested loops over large data sets¶
You write a policy that iterates over a list of 10,000 approved users for every admission request:
At scale, this becomes O(n) per request. With 100 admission requests per second, you're doing millions of comparisons. OPA latency climbs past the webhook timeout and requests start failing.
Fix: Use OPA's set membership check instead of iteration. Pre-index data as a set keyed by the lookup field:
Or load data as a set and use the _[x] membership check. Profile with opa bench before deploying policies that touch large data sets.
Remember: OPA evaluates Rego by unification, not iteration. When you write
data.users[i].name == input.user, OPA scans every element. When you writedata.users[input.user], OPA does a single hash lookup. The difference is O(n) vs O(1) — at 10,000 entries with 100 requests/second, that is the difference between sub-millisecond and 50ms+ latency per decision.