Adapting a playbook to your domain: the method, not the menu.
The previous five playbooks are instances of one method. The point was never to memorize the support or analytics recipe — it was to internalize the derivation so you can produce the playbook for a domain nobody has written up yet: legal review, clinical intake, code migration, claims processing. This essay extracts the meta-method as a repeatable sequence — job, autonomy, tools, eval, guardrails — and the questions that, answered honestly, hand you the playbook for any vertical.
Every playbook is the same five questions in the same order.
Read back across support, analytics, SRE, research, and GTM and the surface differs but the skeleton is identical: define the job precisely, set the autonomy by reversibility, ground via tools, pick the eval signal that mirrors the business cost, and bound the top failure mode with guardrails. Each domain just fills the slots with different specifics. A new playbook is not invented; it is derived by answering these five questions for your vertical honestly — the method is domain-independent, only the answers are local.
Define the job by what must be true, not what it does.
The recurring first mistake is framing the job as the activity ("answer tickets", "send emails", "fix outages") instead of the success and failure conditions. The sharp version is always two predicates: what does a good outcome look like, and what is the one wrong outcome whose cost dominates. Support: resolved vs confidently wrong policy. Analytics: defensible number vs confident wrong number. SRE: real fix vs harmful "fix". The job definition is done when you can state the dominating failure in one sentence — because that sentence designs everything downstream.
Find the dominating failure with one question: "what is the single output that, if wrong and shipped, would get this project cancelled?" That answer is your hard constraint, not a metric to trade against the headline number.
Set autonomy by reversibility, decomposed into zones.
Autonomy is never a single dial for the whole agent — every playbook split the surface into a high-value low-risk zone (informational answers, diagnosis, research, drafting) that runs autonomously, and a consequential irreversible zone (transactions, remediation, send) that is gated. The recurring rule:
# the autonomy law every playbook obeys autonomy = f(reversibility, blast_radius, human_scrutiny) if irreversible(action) or blast_radius(action) > tolerable: require_human_gate(action) # draft, do not do else: autonomous(action) # the value lives here
Decompose first, then assign autonomy per zone. An agent given one global autonomy level is either too timid to be useful or too free to be safe — there is no single setting that is both.
Tools are the grounding and the blast-radius limit, not just capability.
Across every playbook, tools did double duty: they ground the model in truth it must not guess (account state, schema, fetched sources, consent status) and they are where the limit lives (refund caps, read-only roles, suppression filters, scoped runbooks). The derivation question is two-part: what must the agent never invent — that becomes a retrieval or state tool; and what is the worst single call — that becomes a typed signature with the limit enforced at the boundary, not in the prompt. A guardrail expressed as a prompt instruction is a wish; the same guardrail as a tool that refuses out-of-bounds input is a guarantee. This is the single most transferable lesson.
The eval must mirror the business cost function, especially the asymmetry.
Every playbook's eval rejected the headline metric (deflection, open rate, "reads well") because it hides the dominating failure, and weighted that failure to match its real-world asymmetry. The transferable construction:
- Score on real, representative inputs — replayed transcripts, a labeled question bank, past incidents, pre-send drafts — not synthetic happy paths.
- Grade the dominating failure as a hard, heavily-weighted axis — a confident wrong answer must cost far more in the eval than a clean abstention, because it does in the business.
- Reward calibrated abstention — "I cannot answer / I am not confident / escalate" is a correct output, not a miss, whenever the alternative is a confident error.
- Refuse trades that lower the failure rate by also lowering coverage you cannot afford — and never the reverse.
The meta-checklist: derive your domain's playbook.
- Job: state the good outcome and, in one sentence, the dominating failure whose cost would cancel the project.
- Autonomy: decompose into zones; autonomous where reversible and low-blast, gated where irreversible or high-reach.
- Tools: what must it never invent → grounding tool; what is the worst call → typed tool with the limit at the boundary, not the prompt.
- Eval: real representative inputs; dominating failure as a hard heavily-weighted axis; calibrated abstention rewarded.
- Guardrails: bound the top failure structurally before deployment; fail-closed kill switch the resume path respects.
- Changelog the assumptions: write down which answers were guesses so the playbook is revisited when they are tested.
The honest tradeoff: this method front-loads the unglamorous work — naming the failure, decomposing autonomy, building tools that refuse — before the impressive demo, and that ordering is the entire point: every vertical that skipped it shipped a demo that became an incident, and the method is just the discipline of paying that cost on purpose, early, instead of by surprise, late.