The Pre-Ship Safety Review

S8
Deep Dive · Safety, Alignment & Agentic Security

The pre-ship safety review: a practical agent deployment checklist.

The preceding essays build the theory; this one converts it into a review you can actually run before an agent ships. It is organized so that the highest-leverage, fail-closed controls come first, and includes the often-missed supply-chain dimension: every MCP server and third-party tool you connect is part of your trust boundary.

STEP 1

How to use this review

This is a gate, not a survey. An item that cannot be answered with concrete evidence ("here is the allowlist," "here is the test that fails the bypass") is not a pass. Run it before launch and re-run the affected sections on every model swap, prompt change, new tool, or MCP-server upgrade — each of those can silently reopen a closed hole.

Order matters. Capability and trust-boundary controls (Steps 2–3) are fail-closed and remove impact. Detection and monitoring (Step 6) are fail-open and only add visibility. If you are time-constrained, the top of this list is where the security actually lives.

STEP 2

Capability & least privilege (fail-closed — do first)

  • Tool inventory exists; every tool maps to a task that requires it. Unused/"just in case" tools removed.
  • Each tool is scoped to the minimum: read-only where possible, record/row-limited, single-tenant, no free-form destination fields.
  • Credentials are short-lived, per-task, narrowly scoped. No ambient long-lived admin tokens or cloud metadata reachable from the agent's environment.
  • High-impact tools are state-gated — available only in states where they are valid.
  • For every tool: documented answer to "if an attacker controlled this call's inputs, what is the worst outcome?" and that outcome is acceptable or further constrained.
STEP 3

Trust boundary, inputs & supply chain (fail-closed)

  • All four input channels enumerated: direct input, retrieved content, tool/API results, persisted memory/history.
  • Every channel's trust level is documented. Anything externally influenceable (user uploads, public web, editable wikis, tickets, third-party APIs) is treated as untrusted.
  • Untrusted content is isolated from the privileged instruction/tool position (e.g. tool-less reader → validated structured handoff → privileged actor).
  • Provenance is tracked: code can distinguish operator instructions from retrieved/tool text.
  • MCP & third-party supply chain: every connected MCP server and external tool is inventoried with its owner, and treated as an untrusted input source — its responses can carry injected instructions and its tool definitions can change under you.
  • MCP servers and tool dependencies are pinned/version-controlled; updates go through the same safety re-review (a tool whose description or behavior changes is a new trust decision).
  • Default-deny network egress from any code-execution sandbox; outbound destinations allowlisted.

The most overlooked line above is the MCP/third-party one. Connecting an external server hands part of your agent's behavior to that operator's data hygiene and update discipline. An unpinned tool whose definition changes is a supply-chain compromise vector even with no attacker present.

STEP 4

Action controls & exfiltration (fail-closed)

  • Every tool call passes a deterministic pre-execution check (allowed-in-state, schema-valid, destination-allowlisted, within rate/volume limits) — implemented in code, not as a second LLM the same prompt could subvert.
  • Exfiltration sinks enumerated beyond network tools: rendered Markdown/HTML (auto-loaded images, link unfurling, prefetch), write-then-read side channels (tickets, PR comments, shared logs), error/timing channels.
  • Agent output that gets rendered is sanitized: external image/link URLs stripped or proxied; no surface auto-fetches attacker-chosen URLs.
  • Irreversible / externally-visible / privilege-changing / high-value actions require human approval via a reviewable summary and an out-of-band confirmation the in-context attacker cannot forge.
  • Default-deny on timeout or uncertainty: no approval, no action.
STEP 5

Alignment & objective hygiene

  • The agent's objective/reward signal reviewed for gameable single-metric proxies; paired with guardrail metrics that catch degenerate strategies.
  • Agent work is inspectable: plan, reasoning, and a reviewable diff are exposed before high-impact commits.
  • Verification asymmetry exploited where possible (tests/validators check outcomes cheaply rather than trusting the agent's claim).
  • Conservative default under uncertainty: when intent is unclear, the agent asks or stops rather than optimizing the proxy.
STEP 6

Evaluation, monitoring & response (fail-open — visibility, last)

  • Versioned adversarial test suite covering every threat-model category, graded on harmful outcome, run against the shipping system on every change.
  • Every deterministic control has at least one bypass test that fails (proof the boundary holds).
  • Independent adversarial review by people who did not design the agent found no outcome-converting break.
  • All tool calls logged with arguments; alerts on first-seen destinations, anomalous data volume in arguments, and read-sensitive-then-egress sequences.
  • An incident path exists: kill switch / capability revocation, and a process to turn every real incident into a permanent regression test.
┌────────────────────────────────────────────────────────┐ │ PRE-SHIP GATE (top = fail-closed, highest leverage) │ │ │ │ ① capability & least privilege │ │ ② trust boundary · inputs · MCP supply chain │ │ ③ action checks · exfiltration sinks · approvals │ │ ④ alignment / objective hygiene │ │ ⑤ eval · monitoring · incident response (visibility) │ │ │ │ re-run ①–③ on every model/prompt/tool/MCP change │ └────────────────────────────────────────────────────────┘
Question
We can't satisfy everything before the deadline. What is the minimum viable safe ship?

If you do only three things: (1) ruthless least privilege so the worst tool an injection could reach is acceptable on its own; (2) treat every external input and every MCP/third-party tool as untrusted and isolate them from privileged actions; (3) human approval on the short list of irreversible/external actions. Those are all fail-closed and cover the bulk of real incidents. Detection and fancy filtering are improvements on top, not substitutes for these.

Question
This looks like a one-time launch checklist. Isn't agent safety continuous?

It is both. The gate is the launch bar; Steps 1–3 in particular are re-run triggers, not one-offs, because a model swap, a prompt edit, a new tool, or an MCP-server update can each silently invalidate a prior pass. Treat the checklist as a CI gate plus a change-triggered re-review, with monitoring and the growing regression suite running continuously between gates.