Operations / Safety & Security

Safety & Security

Prompt injection, sandboxing, exfiltration, red-teaming, deployment safety — the threat model an agent's environment creates.

The Agentic Threat Model

Why autonomy and tool use widen the attack surface, and the four channels attacker-influenced text reaches an agent.
Prompt Injection: Direct & Indirect

How prompt injection works, why no clean fix exists, and the layered defense pattern for defenders.
Data Exfiltration & Tool Misuse

The confused-deputy pattern in agents: exfiltration sources, hidden sinks, and how to cut the chain.
Guardrails: Filtering, Sandboxing & Scoping

Probabilistic vs deterministic guardrails and how to layer input, output, sandbox and capability controls.
Human-in-the-Loop & Least Privilege

Bounded autonomy by design: least privilege as default and consequence-based approval gates.
Red-Teaming & Safety Evaluation

Adversarial testing of agents as a repeatable, outcome-graded pipeline gate, not a one-off session.
Alignment Basics: Intent & Oversight

Instruction-following vs intent, reward hacking, and scalable oversight as the practical builder lever.
The Pre-Ship Safety Review

A practical, fail-closed-first deployment checklist including MCP/third-party supply-chain trust.
RAG Pipeline Security

Why retrieved context is untrusted input that skipped the guard — corpus poisoning, indirect injection, embedding leakage, and the trust-boundary design that contains them.