Make it steerable, parallel, interruptible.
Now add features deliberately skipped in Phase 1. Each one solves a specific failure mode you saw in traces — not a hypothetical one. Parallel tool calls, subagents for context isolation, planning, interrupts.
Enable parallel tool calls.
Look back at your Phase 1 traces. How often did the agent run searches sequentially when they were obviously independent? "Find docs about X" → wait → "find docs about Y" → wait. Each wait is a round-trip to the model plus a round-trip to your tools. Parallel tool calls cut that to one round-trip total.
Both APIs support multiple tool calls per turn — the model returns several tool_use (or function_call) blocks in a single response. Your loop just needs to execute them concurrently and pair each result to the right call ID.
The change: async execution
We use Python's asyncio to run tool handlers concurrently. Handlers don't need to be async themselves — we wrap synchronous ones with asyncio.to_thread.
# agent/loop.py — parallel-capable import asyncio from anthropic import AsyncAnthropic client = AsyncAnthropic() async def run_one_tool(block): try: result = await asyncio.to_thread( HANDLERS[block.name], **block.input ) return { "type": "tool_result", "tool_use_id": block.id, "content": str(result), } except Exception as e: return { "type": "tool_result", "tool_use_id": block.id, "content": f"error: {e}", "is_error": True, } async def run_agent(user_query, max_steps=10): messages = [{"role": "user", "content": user_query}] for step in range(max_steps): response = await client.messages.create( model="claude-sonnet-4-5", max_tokens=4096, system=SYSTEM_PROMPT, tools=TOOLS, messages=messages, ) messages.append({"role": "assistant", "content": response.content}) tool_blocks = [b for b in response.content if b.type == "tool_use"] if not tool_blocks: return {"status": "halted"} # Check for submit_answer first (don't run others) for b in tool_blocks: if b.name == "submit_answer": return {"status": "answered", **b.input} # PARALLEL: run all tool blocks at once results = await asyncio.gather( *[run_one_tool(b) for b in tool_blocks] ) messages.append({"role": "user", "content": results}) return {"status": "step_limit"}
# agent/loop.py — parallel-capable import asyncio, json from openai import AsyncOpenAI client = AsyncOpenAI() async def run_one_tool(call): args = json.loads(call.arguments) try: result = await asyncio.to_thread( HANDLERS[call.name], **args ) output = str(result) except Exception as e: output = f"error: {e}" return { "type": "function_call_output", "call_id": call.call_id, "output": output, } async def run_agent(user_query, max_steps=10): input_items = [{"role": "user", "content": user_query}] for step in range(max_steps): response = await client.responses.create( model="gpt-5.5", instructions=SYSTEM_PROMPT, tools=TOOLS, input=input_items, ) for item in response.output: input_items.append(item.model_dump()) calls = [i for i in response.output if i.type == "function_call"] if not calls: return {"status": "halted"} for call in calls: if call.name == "submit_answer": return {"status": "answered", **json.loads(call.arguments)} # PARALLEL: run all calls at once results = await asyncio.gather( *[run_one_tool(c) for c in calls] ) input_items.extend(results) return {"status": "step_limit"}
Watch the behavior change
Re-run the multi-hop question from Phase 1. The model knows it can fan out now (the API tells it parallel calls are available), so it tends to use that capacity automatically.
$ python scripts/run.py "How does PgBouncer transaction pooling
interact with prepared statements?"
──────────────────── Step 0 ────────────────────
┌─ thinking ─────────────────────────────────────┐
│ I need info on two separate topics. I'll │
│ search for both in parallel. │
└────────────────────────────────────────────────┘
→ search_docs({'query': 'PgBouncer transaction pooling'})
→ search_docs({'query': 'prepared statements PostgreSQL'})
[both run concurrently, ~180ms total]
──────────────────── Step 1 ────────────────────
→ fetch_doc({'chunk_id': 'pgbouncer-modes::1'})
→ fetch_doc({'chunk_id': 'sql-prepare::0'})
[both run concurrently, ~40ms total]
──────────────────── Step 2 ────────────────────
→ submit_answer({...})
status: answered (3 steps)
5 steps → 3 steps. The same work got done, but in half the round-trips. For a real production agent doing 20–50 tool calls per investigation, parallelization compresses runtime dramatically — often 3–5x for retrieval-heavy workflows.
The cost: handlers must be safe to run concurrently. search_docs and fetch_doc are pure reads — fine. If you had a save_note tool that wrote to a shared file, you'd need a lock.
Both models will parallelize when the calls are clearly independent (different searches on different topics). For ambiguous cases, you can nudge the model with system-prompt language like "When you need information on multiple unrelated topics, search for them in parallel rather than sequentially."
Don't over-prompt for parallelism — sometimes sequential is correct. "First search for X. Then based on what you find, search for Y." is a legitimate plan that parallelism would break.
asyncio.to_thread wrong?It works, but isn't optimal. to_thread moves the synchronous function to a worker thread, which is fine for blocking I/O but uses a thread per call. If you have native async handlers (e.g., httpx.AsyncClient), call them directly with await instead.
For Phase 1's substring-scan handlers, to_thread is correct because they're CPU-bound, single-threaded Python.
Add subagents for context isolation.
This is the single most important pattern in modern agent design, and the hardest one to get right. The problem: the main agent's context window is precious. Fill it with raw search results, fetched documents, and verbose tool outputs, and quality collapses. The model loses track of the original question, struggles to keep facts straight, starts repeating itself.
The solution: delegate research to subagents. A subagent has its own context window. It does deep research on a narrow task, accumulates whatever intermediate state it needs, then returns a compressed summary to the parent. The parent never sees the raw work.
This pattern is exactly how Claude's Research feature, Perplexity's deep research, and OpenAI's research agents work under the hood.
Implementation: a subagent is just recursion
Add a new tool: spawn_subagent(task, focus). The handler calls run_agent recursively with the task as its query, but with a different system prompt and the spawn tool removed (to prevent runaway recursion).
# agent/tools.py — add to TOOLS list { "name": "spawn_subagent", "description": ( "Delegate a focused research task to a subagent. " "The subagent has its own context window, runs its " "own retrieval, and returns a concise summary with " "citations. Use when researching multiple topics " "that would each require many sources." ), "input_schema": { # or "parameters" for OpenAI "type": "object", "properties": { "task": {"type": "string", "description": "the question to research"}, "focus": {"type": "string", "description": "what aspect matters most"}, }, "required": ["task"], }, }
# agent/subagent.py SUBAGENT_PROMPT = """You are a research subagent. Your task: {task} Focus on: {focus} You have search_docs, fetch_doc, and submit_answer. You cannot spawn further subagents. Return a summary under 1500 tokens with citations. Be dense — the parent agent only sees what you submit.""" async def spawn_subagent(task: str, focus: str = "") -> dict: # Same loop, different prompt, restricted tools sub_tools = [t for t in TOOLS if t.get("name") != "spawn_subagent"] sub_prompt = SUBAGENT_PROMPT.format(task=task, focus=focus) result = await run_agent( user_query=task, system_prompt=sub_prompt, tools=sub_tools, max_steps=15, ) return { "summary": result.get("answer", ""), "citations": result.get("citations", []), "steps_used": result.get("steps_used", 0), }
Now watch what happens on a comparison query
$ python scripts/run.py "Compare PgBouncer modes for use with
prepared statements and ORM features"
──────────────────── Step 0 ────────────────────
┌─ thinking ─────────────────────────────────────┐
│ This needs deep research on (1) PgBouncer │
│ modes and (2) prepared statements / ORM │
│ implications. I'll spawn two subagents in │
│ parallel to keep my context lean. │
└────────────────────────────────────────────────┘
→ spawn_subagent(task='Explain all three PgBouncer
pool modes in detail',
focus='which features each mode breaks')
→ spawn_subagent(task='How do prepared statements
and ORM features interact with connection pooling',
focus='session vs transaction scoping')
[each subagent runs its own ~10-step investigation
in its own context window, in parallel]
──────────────────── Step 1 ────────────────────
subagent A returned: 1400-token summary citing
pgbouncer-modes::0, pgbouncer-modes::1,
pgbouncer-modes::2, pgbouncer-features::3
subagent B returned: 1200-token summary citing
sql-prepare::0, libpq-prepare::1,
pgbouncer-features::4
┌─ thinking ─────────────────────────────────────┐
│ I have both research summaries. Synthesizing. │
└────────────────────────────────────────────────┘
→ submit_answer({
'answer': 'PgBouncer offers three pool modes...
Session pooling is safe for all features but
limits concurrency. Transaction pooling breaks
prepared statements, SET LOCAL, advisory locks...
ORMs that rely on prepared statement caches
(SQLAlchemy, Django ORM) require session mode
or PREPARE workarounds...',
'citations': ['pgbouncer-modes::0', ...]
})
status: answered (2 steps, but subagents ran 18 internal steps)
The main agent made 2 visible steps. But underneath, two subagents did 9 steps each. The total work was 20 model calls — yet the main context window saw only ~3000 tokens of summaries, not the 80,000 tokens of raw research.
Without subagents, that same investigation would have either hit context limits or produced a degraded answer because the model was overwhelmed by intermediate results. With subagents, the main agent stays sharp because it only ever reads final answers, never working notes.
Bound the recursion depth. The subagent's tool list explicitly excludes spawn_subagent. Without this, a misbehaving model can spawn subagents that spawn subagents, and you'll learn this at 3am when your API credits hit zero. Depth=1 is enough for almost all cases.
For simple lookups, spawning is pure overhead — you pay an extra model call and the orchestration of a subagent for what a single search would have answered. The agent should only spawn when the task has substructure that benefits from isolation:
- Comparing two or more things (each thing → one subagent)
- Investigations that need many tool calls on a narrow topic
- Tasks where intermediate results are bulky and you want to summarize them
You can encode this in the system prompt: "Use subagents only when researching topics that would each require 4+ tool calls. For simple lookups, search directly."
Parallel tool calls let the agent fan out at the tool-call level. Subagents let it fan out at the investigation level. Two different layers of parallelism that compose:
- Parallel tool calls: one model decides "search for X, Y, Z at once" and reads all three results itself.
- Subagents: the main model delegates "research X" to a copy of itself that runs its own multi-step investigation with its own parallel searches.
Use both. Subagents internally use parallel tool calls. The main agent uses both subagents and parallel tool calls. The combination is what makes deep-research agents fast.
Add a planning step (user-editable).
For complex queries — multi-hop, comparative, ambiguous — it helps to generate a plan before executing. Two reasons. First, the model is better at producing a good investigation when it commits to one upfront. Second, the user can edit the plan before the agent burns tokens going down the wrong path. That edit step is what "steerable" actually means in practice.
The planner: a focused model call
PLANNER_PROMPT = """You are planning an investigation.
Given a user's question, produce a numbered plan of
3-5 steps describing how an agent should research
the answer.
Each step should be one of:
- SEARCH: a specific keyword search
- COMPARE: comparing two things found via search
- VERIFY: confirming a claim with a second source
- SYNTHESIZE: combining findings into the answer
For simple lookups (one fact, one doc), output
exactly the word: SIMPLE
Output the plan or SIMPLE, nothing else."""
# agent/planner.py async def make_plan(query: str) -> list[str] | None: response = await client.messages.create( model="claude-sonnet-4-5", max_tokens=512, system=PLANNER_PROMPT, messages=[{"role": "user", "content": query}], ) text = response.content[0].text.strip() if text == "SIMPLE": return None return [line for line in text.split("\n") if line.strip()]
# agent/planner.py async def make_plan(query: str) -> list[str] | None: response = await client.responses.create( model="gpt-5.5", instructions=PLANNER_PROMPT, input=query, ) text = response.output_text.strip() if text == "SIMPLE": return None return [line for line in text.split("\n") if line.strip()]
The user-editing loop
async def run_with_plan(query: str): plan = await make_plan(query) if plan is None: # Simple query — skip planning return await run_agent(query) console.print("\n[bold]Proposed plan:[/]") for i, step in enumerate(plan, 1): console.print(f" {i}. {step}") console.print("\n[Enter] approve [e] edit [s] skip plan") choice = input("> ").strip().lower() if choice == "e": plan = edit_plan_interactive(plan) elif choice == "s": return await run_agent(query) plan_block = "\n".join(f"{i+1}. {s}" for i, s in enumerate(plan)) enriched = f"{query}\n\nFollow this plan:\n{plan_block}" return await run_agent(enriched)
What it looks like in practice
$ python scripts/run.py "Compare auth approaches in Postgres
for a web app, focused on changes since version 14"
Proposed plan:
1. SEARCH: authentication methods PostgreSQL pg_hba
2. SEARCH: scram-sha-256 vs md5 password auth
3. SEARCH: PostgreSQL 14 15 16 release notes auth
4. COMPARE: trade-offs of each method for web apps
5. SYNTHESIZE: recommendation with version caveats
[Enter] approve [e] edit [s] skip plan
> e
Editing plan. Current step 1: SEARCH: authentication...
[Enter] keep [r] replace [d] delete [a] add after
> r
New step 1: SEARCH: pg_hba.conf authentication methods
[Enter] keep [r] replace [d] delete [a] add after
> [Enter] (keeps step 2)
...
Approved plan:
1. SEARCH: pg_hba.conf authentication methods
2. SEARCH: scram-sha-256 vs md5 password auth
3. SEARCH: PostgreSQL 14 15 16 release notes auth
4. COMPARE: trade-offs for web apps
5. SYNTHESIZE: recommendation with version caveats
──────────────── starting agent ────────────────
→ search_docs({'query': 'pg_hba.conf authentication
methods'})
...
The model's first plan was reasonable, but a real Postgres user might know that "pg_hba.conf" is the specific term to search for, not "authentication methods." Editing step 1 saves 1–2 follow-up searches the agent would have needed to realize it should look at pg_hba.
This is the steerability handle. Without it, the agent's plan is a black box that you can only critique after it spends 10 model calls. With it, you can correct the trajectory before any expensive work happens.
That's why the planner outputs SIMPLE for queries that don't need a plan. "What port does Postgres use?" gets SIMPLE; "Compare three things across versions" gets a plan. The user never sees a plan for trivial questions.
For non-trivial queries, the plan-approval step takes ~10 seconds and saves dozens of API calls when it catches a misdirected investigation early. Net win for everyone except the simplest cases — which the planner skips automatically.
Make runs interruptible and resumable.
Long investigations might take minutes. You want to be able to pause one, inspect what the agent has decided so far, edit a tool result that looks wrong, and resume from that point. This is what real Agent SDKs give you. Building it yourself teaches you exactly what state matters.
The trick: persist messages to disk after every iteration. The message history is the agent's state. If you have the history, you can resume.
# agent/state.py import json, uuid from pathlib import Path class RunState: def __init__(self, run_id: str = None): self.run_id = run_id or str(uuid.uuid4())[:8] self.path = Path(f"runs/{self.run_id}.jsonl") self.path.parent.mkdir(exist_ok=True) def checkpoint(self, step: int, history: list): with self.path.open("a") as f: f.write(json.dumps({ "step": step, "history": history, }, default=str) + "\n") def resume(self) -> list: if not self.path.exists(): return [] last = None with self.path.open() as f: for line in f: last = json.loads(line) return last["history"] if last else []
Wire it into the loop
async def run_agent(query, max_steps=10, run_id=None): state = RunState(run_id) history = state.resume() or [ {"role": "user", "content": query} ] start_step = (len(state.resume()) // 2) if state.resume() else 0 for step in range(start_step, max_steps): response = await client.messages.create(...) history.append({"role": "assistant", ...}) # ... dispatch tools, append results ... state.checkpoint(step, history) # persist every step return {"run_id": state.run_id, ...}
What you can now do
Stop a run mid-flight (Ctrl-C). Resume it later:
$ python scripts/run.py "Compare auth methods..." # ... runs for a few steps, you hit Ctrl-C ... ^C Interrupted at step 3. Run ID: a4f7c2e1 $ python scripts/run.py --resume a4f7c2e1 # resumes from step 4, history intact
Or inspect a run and edit a tool result:
$ python scripts/inspect.py a4f7c2e1
Step 2: search_docs({'query': 'auth methods'})
→ returned 5 chunks, but the top one is wrong
→ press [e] to edit the result before resuming
> e
Editing tool_result at step 2...
[edit in your editor]
Resumed. Re-running from step 3 with edited result.
This is the foundation of "time-travel debugging" for agents. When evals in Phase 4 find a failing query, you'll want to inspect exactly where the agent went wrong and try fixes — different prompts, different tool results, different parameters — without re-running expensive steps before the failure point.
Yes. Both Anthropic's Agent SDK and OpenAI's Agents SDK ship with run-state persistence and the ability to resume from any checkpoint. You've now built a minimal version of that mechanism.
Now if you adopt an SDK, you'll know exactly what the abstraction is doing under the hood — and you'll know when its choices don't fit your needs and you need to build something custom instead.
JSONL is append-only, human-readable, and survives every kind of crash. You can cat a run file, grep for specific tool calls, diff two runs. SQLite would also work and gives you easier queries — switch when you have hundreds of runs and want to analyze across them.
At this scale, JSONL is the right starting point.
Deliverable
An agent that plans before complex queries, fans out searches in parallel, isolates deep research in subagents, and can be paused and resumed. The same 10 questions from Phase 1 should show qualitatively different (faster, cleaner) behavior.
- Async parallel tool execution
- Subagent spawning with depth bound
- Planner with user-editable plans
- JSONL checkpoint + resume from any step
- Trace diff: phase 2 vs phase 3 on the test set