Data Governance for Agents

C5
Operation · Governance & Compliance

Data governance for agents — an agent is a data-flow machine first.

An agent's most under-governed property is that it moves data: it pulls from stores, embeds it, passes it to a third-party model, writes it back, and routes it to tools — often crossing trust and jurisdictional boundaries on every step, invisibly. Classical data governance assumes data moves through paths a human designed; an agent invents the path at runtime. This essay covers what changes for agents specifically: lineage through the loop, consent and purpose limitation, PII handling, training and eval data governance, and cross-border flow.

STEP 1

Lineage through the loop, not just into the model.

Data lineage normally tracks data from source to a known destination. An agent breaks that: a record retrieved at step 3 can land in the model's context at step 4, be summarized at step 6, written to a tool at step 9, and surface in the final answer — a path no schema declared. Agent data governance needs lineage that follows data through the loop: for any output or side effect, you must be able to answer which sources contributed. This is the C1 provenance chain put to a data question — without per-step source tracking, you cannot answer "did customer A's data reach output B", which is the exact question a breach or a data-subject request forces.

STEP 2

Consent and purpose limitation survive into the loop.

Data collected for one purpose does not become free for any purpose once an agent can reach it. Purpose limitation and the consent basis attach to the data and must be enforced at retrieval and tool-call time, not assumed at collection. The practical mechanism is to carry the purpose/consent basis as metadata and let the policy engine (C2) gate use against it: support-consented data must not flow into a marketing action, regardless of how the agent got there. The agent's ability to creatively recombine data is precisely why the consent check has to live at the point of use, not the point of intake.

# purpose binds the data; checked at point of use, not intake
rec = store.get(id)            # rec.purpose = "support"
if action.purpose not in rec.allowed_purposes:
    raise PurposeViolation(rec.id, action.purpose)
# support-consented data cannot flow into a marketing action

Tag data with purpose and consent basis at ingestion so the runtime check is a lookup, not a guess. Untagged data should default to the most restrictive purpose, not the most permissive.

STEP 3

PII handling: minimize before the boundary, not after.

The riskiest moment is the one that is easiest to ignore: PII crossing into the model's context, especially a third-party model outside your trust boundary. Govern it with data minimization at the boundary — only the fields the task needs reach the model; redact or tokenize identifiers the reasoning does not require; prefer references over raw values where the agent only needs to act, not read. The default of "send the whole record because it's convenient" is how PII ends up in a vendor's logs, an embedding index, or a prompt-injection exfiltration sink. Minimization is cheaper and more defensible than every control you would otherwise need downstream.

Embeddings are not anonymization. A vector derived from personal data is still personal data — a vector store inherits the same consent, retention, and erasure obligations as the source.

STEP 4

Training and eval data is governed data too.

If you fine-tune or build eval sets from real interactions, that pipeline is a data-governance surface, not an internal convenience. Production traces carry PII and customer content; copying them into a training or eval corpus is a new processing purpose that needs its own basis, minimization, and retention. Two failures dominate: consent leakage — data collected for service delivery silently reused for model training without a basis — and memorization risk — a model trained on un-minimized records can regurgitate them. Govern the trace-to-training flywheel (familiar from the Evaluation material) with the same controls as any other personal-data processing, and record its provenance as deliberately as the model's.

STEP 5

Cross-border flow is often invisible and always governed.

Calling a hosted model can move personal data across a jurisdictional boundary in a single function call, with no UI, no log line, and no human aware it happened. Data-residency and cross-border-transfer rules do not care that the transfer was implicit in an API call. Governance here means knowing, per data class, where it is allowed to be processed, and constraining model and tool routing accordingly — region-pinned endpoints, in-region inference, or boundary minimization so what crosses is no longer personal data. An agent that picks a tool or model endpoint at runtime can route regulated data offshore unless the routing layer is itself governed.

STEP 6

The honest tradeoff.

Real data governance constrains the agent: minimization and purpose checks remove context the model might have used, region pinning narrows model and tool choice, and consent metadata is engineering most teams skip until an incident or a data-subject request forces it — expensively, retroactively, across an undocumented data flow. Govern the data flows that carry personal or regulated data with point-of-use enforcement and boundary minimization; the convenience of "just pass the whole record to the model" is a liability you are deferring, not avoiding.