Operations / Safety & Security

Safety & Security

Prompt injection, sandboxing, exfiltration, red-teaming, deployment safety — the threat model an agent's environment creates.

  1. The Agentic Threat Model
    Why autonomy and tool use widen the attack surface, and the four channels attacker-influenced text reaches an agent.
  2. Prompt Injection: Direct & Indirect
    How prompt injection works, why no clean fix exists, and the layered defense pattern for defenders.
  3. Data Exfiltration & Tool Misuse
    The confused-deputy pattern in agents: exfiltration sources, hidden sinks, and how to cut the chain.
  4. Guardrails: Filtering, Sandboxing & Scoping
    Probabilistic vs deterministic guardrails and how to layer input, output, sandbox and capability controls.
  5. Human-in-the-Loop & Least Privilege
    Bounded autonomy by design: least privilege as default and consequence-based approval gates.
  6. Red-Teaming & Safety Evaluation
    Adversarial testing of agents as a repeatable, outcome-graded pipeline gate, not a one-off session.
  7. Alignment Basics: Intent & Oversight
    Instruction-following vs intent, reward hacking, and scalable oversight as the practical builder lever.
  8. The Pre-Ship Safety Review
    A practical, fail-closed-first deployment checklist including MCP/third-party supply-chain trust.
  9. RAG Pipeline Security
    Why retrieved context is untrusted input that skipped the guard — corpus poisoning, indirect injection, embedding leakage, and the trust-boundary design that contains them.