Operations / Safety & Security
Safety & Security
Prompt injection, sandboxing, exfiltration, red-teaming, deployment safety — the threat model an agent's environment creates.
- The Agentic Threat ModelWhy autonomy and tool use widen the attack surface, and the four channels attacker-influenced text reaches an agent.
- Prompt Injection: Direct & IndirectHow prompt injection works, why no clean fix exists, and the layered defense pattern for defenders.
- Data Exfiltration & Tool MisuseThe confused-deputy pattern in agents: exfiltration sources, hidden sinks, and how to cut the chain.
- Guardrails: Filtering, Sandboxing & ScopingProbabilistic vs deterministic guardrails and how to layer input, output, sandbox and capability controls.
- Human-in-the-Loop & Least PrivilegeBounded autonomy by design: least privilege as default and consequence-based approval gates.
- Red-Teaming & Safety EvaluationAdversarial testing of agents as a repeatable, outcome-graded pipeline gate, not a one-off session.
- Alignment Basics: Intent & OversightInstruction-following vs intent, reward hacking, and scalable oversight as the practical builder lever.
- The Pre-Ship Safety ReviewA practical, fail-closed-first deployment checklist including MCP/third-party supply-chain trust.
- RAG Pipeline SecurityWhy retrieved context is untrusted input that skipped the guard — corpus poisoning, indirect injection, embedding leakage, and the trust-boundary design that contains them.