The Pre-Ship Safety Review

Deep Dive · Safety, Alignment & Agentic Security

The pre-ship safety review: a practical agent deployment checklist.

The preceding essays build the theory; this one converts it into a review you can actually run before an agent ships. It is organized so that the highest-leverage, fail-closed controls come first, and includes the often-missed supply-chain dimension: every MCP server and third-party tool you connect is part of your trust boundary.

STEP 1

How to use this review

This is a gate, not a survey. An item that cannot be answered with concrete evidence ("here is the allowlist," "here is the test that fails the bypass") is not a pass. Run it before launch and re-run the affected sections on every model swap, prompt change, new tool, or MCP-server upgrade — each of those can silently reopen a closed hole.

Order matters. Capability and trust-boundary controls (Steps 2–3) are fail-closed and remove impact. Detection and monitoring (Step 6) are fail-open and only add visibility. If you are time-constrained, the top of this list is where the security actually lives.

STEP 2

Capability & least privilege (fail-closed — do first)

Tool inventory exists; every tool maps to a task that requires it. Unused/"just in case" tools removed.
Each tool is scoped to the minimum: read-only where possible, record/row-limited, single-tenant, no free-form destination fields.
Credentials are short-lived, per-task, narrowly scoped. No ambient long-lived admin tokens or cloud metadata reachable from the agent's environment.
High-impact tools are state-gated — available only in states where they are valid.
For every tool: documented answer to "if an attacker controlled this call's inputs, what is the worst outcome?" and that outcome is acceptable or further constrained.

STEP 3

Trust boundary, inputs & supply chain (fail-closed)

All four input channels enumerated: direct input, retrieved content, tool/API results, persisted memory/history.
Every channel's trust level is documented. Anything externally influenceable (user uploads, public web, editable wikis, tickets, third-party APIs) is treated as untrusted.
Untrusted content is isolated from the privileged instruction/tool position (e.g. tool-less reader → validated structured handoff → privileged actor).
Provenance is tracked: code can distinguish operator instructions from retrieved/tool text.
MCP & third-party supply chain: every connected MCP server and external tool is inventoried with its owner, and treated as an untrusted input source — its responses can carry injected instructions and its tool definitions can change under you.
MCP servers and tool dependencies are pinned/version-controlled; updates go through the same safety re-review (a tool whose description or behavior changes is a new trust decision).
Default-deny network egress from any code-execution sandbox; outbound destinations allowlisted.

The most overlooked line above is the MCP/third-party one. Connecting an external server hands part of your agent's behavior to that operator's data hygiene and update discipline. An unpinned tool whose definition changes is a supply-chain compromise vector even with no attacker present.

STEP 4

Action controls & exfiltration (fail-closed)

Every tool call passes a deterministic pre-execution check (allowed-in-state, schema-valid, destination-allowlisted, within rate/volume limits) — implemented in code, not as a second LLM the same prompt could subvert.
Exfiltration sinks enumerated beyond network tools: rendered Markdown/HTML (auto-loaded images, link unfurling, prefetch), write-then-read side channels (tickets, PR comments, shared logs), error/timing channels.
Agent output that gets rendered is sanitized: external image/link URLs stripped or proxied; no surface auto-fetches attacker-chosen URLs.
Irreversible / externally-visible / privilege-changing / high-value actions require human approval via a reviewable summary and an out-of-band confirmation the in-context attacker cannot forge.
Default-deny on timeout or uncertainty: no approval, no action.

STEP 5

Alignment & objective hygiene

The agent's objective/reward signal reviewed for gameable single-metric proxies; paired with guardrail metrics that catch degenerate strategies.
Agent work is inspectable: plan, reasoning, and a reviewable diff are exposed before high-impact commits.
Verification asymmetry exploited where possible (tests/validators check outcomes cheaply rather than trusting the agent's claim).
Conservative default under uncertainty: when intent is unclear, the agent asks or stops rather than optimizing the proxy.

STEP 6

Evaluation, monitoring & response (fail-open — visibility, last)

Versioned adversarial test suite covering every threat-model category, graded on harmful outcome, run against the shipping system on every change.
Every deterministic control has at least one bypass test that fails (proof the boundary holds).
Independent adversarial review by people who did not design the agent found no outcome-converting break.
All tool calls logged with arguments; alerts on first-seen destinations, anomalous data volume in arguments, and read-sensitive-then-egress sequences.
An incident path exists: kill switch / capability revocation, and a process to turn every real incident into a permanent regression test.

┌────────────────────────────────────────────────────────┐ │ PRE-SHIP GATE (top = fail-closed, highest leverage) │ │ │ │ ① capability & least privilege │ │ ② trust boundary · inputs · MCP supply chain │ │ ③ action checks · exfiltration sinks · approvals │ │ ④ alignment / objective hygiene │ │ ⑤ eval · monitoring · incident response (visibility) │ │ │ │ re-run ①–③ on every model/prompt/tool/MCP change │ └────────────────────────────────────────────────────────┘

Question

We can't satisfy everything before the deadline. What is the minimum viable safe ship?

If you do only three things: (1) ruthless least privilege so the worst tool an injection could reach is acceptable on its own; (2) treat every external input and every MCP/third-party tool as untrusted and isolate them from privileged actions; (3) human approval on the short list of irreversible/external actions. Those are all fail-closed and cover the bulk of real incidents. Detection and fancy filtering are improvements on top, not substitutes for these.

Question

This looks like a one-time launch checklist. Isn't agent safety continuous?

It is both. The gate is the launch bar; Steps 1–3 in particular are re-run triggers, not one-offs, because a model swap, a prompt edit, a new tool, or an MCP-server update can each silently invalidate a prior pass. Treat the checklist as a CI gate plus a change-triggered re-review, with monitoring and the growing regression suite running continuously between gates.