1.1
Part I / Build · Week 1

Build the smallest possible agent.

One model, three tools, a while-loop. No frameworks. The goal isn't to ship — it's to feel every primitive in your hands so you have opinions later.

STEP 1

Set up the project skeleton.

Before any code, we need three things: a corpus (the documents the agent will read), a place to put our agent code, and a Python environment with an SDK installed. That's it. No vector database, no framework, no orchestration layer. Every piece of infrastructure you add is one more thing hiding the primitives you're trying to learn.

Picking a corpus

A corpus is just a folder of text files. Pick something you care about — you'll know immediately when the agent is wrong. Good choices: the PostgreSQL documentation, your own markdown notes, or a single open-source project's docs (React, Rust book, Kubernetes concepts). Aim for 50–500 markdown files.

Fewer than 50 and the agent rarely needs to think; more than 500 and you'll spend too much time waiting on retrieval iterations. For this tutorial we'll assume PostgreSQL docs.

The folder layout

# Create the project
mkdir research-agent && cd research-agent
mkdir -p corpus agent evals scripts runs

# Files we'll fill in over the next steps
touch agent/__init__.py
touch agent/loop.py        # the agentic loop
touch agent/tools.py       # tool definitions + handlers
touch agent/prompts.py     # system prompts
touch scripts/run.py       # CLI entry point
touch .env                 # for API keys

Run tree (or ls -R) and you'll see:

research-agent/
├── agent/
│   ├── __init__.py
│   ├── loop.py
│   ├── prompts.py
│   └── tools.py
├── corpus/         # populate with your .md files
├── evals/
├── runs/
├── scripts/
│   └── run.py
└── .env

Installing dependencies

The Python environment is identical for both APIs — only the SDK package differs.

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install Anthropic SDK + utilities
pip install anthropic python-dotenv rich
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install OpenAI SDK + utilities
pip install openai python-dotenv rich

Three packages, only one provider-specific. python-dotenv reads your API key from .env. rich makes traces readable.

The .env file

# .env — get this from console.anthropic.com
ANTHROPIC_API_KEY=sk-ant-api03-...your-key...
# .env — get this from platform.openai.com/api-keys
OPENAI_API_KEY=sk-proj-...your-key...

Add .env to .gitignore immediately — keys checked into git get scraped within hours.

Question
Can I use both APIs in the same project?

Yes. The agent's loop, tools, and corpus stay identical — only the model call swaps. By end of Phase 1 you'll have two versions of loop.py with the same behavior, which is the cleanest way to feel what each API's design choices actually cost or buy you.

Install both SDKs and put both keys in .env. Otherwise pick one for now.

Question
Why no framework? LangChain has all of this.

Because the framework hides what you're trying to learn. LangChain's AgentExecutor is a 400-line class managing the loop, tool dispatch, retries, memory, and parsing — exactly the primitives this tutorial is about. Start with it and you'll know how to configure an agent, not how one works.

After you've built this from scratch, frameworks become tools instead of black boxes.

Run git init && git add . && git commit -m "scaffold" now. You'll want a clean checkpoint to diff against once the agent works.

STEP 2

Define three tools. Exactly three.

An agent is a model that calls tools in a loop. Without tools, you have a chatbot. With tools, you have an agent. Three tools is the minimum that lets a research agent show its reasoning.

The three tools and why

search_docs(query) finds relevant documents — returns doc IDs plus snippets, not full text. Snippets keep context lean and force the agent to decide which docs are worth fetching.

fetch_doc(doc_id) reads a full document. Separating "find" from "read" is deliberate: it makes the agent's relevance judgments visible in the trace.

submit_answer(answer, citations) ends the loop with structured output. Without this, the agent might just say the answer in plain text and we'd have no way to extract citations.

Writing the tool schemas

Both APIs use JSON Schema for parameters, but the wrapping differs. Anthropic wraps in {name, description, input_schema}. OpenAI's Responses API uses {type: "function", name, description, parameters}. The schemas themselves are identical.

# agent/tools.py
TOOLS = [
    {
        "name": "search_docs",
        "description": (
            "Search the corpus by keyword. Returns up to "
            "5 matches, each with doc_id and 300-char snippet."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"}
            },
            "required": ["query"],
        },
    },
    {
        "name": "fetch_doc",
        "description": "Fetch full text of a doc by doc_id.",
        "input_schema": {
            "type": "object",
            "properties": {
                "doc_id": {"type": "string"}
            },
            "required": ["doc_id"],
        },
    },
    {
        "name": "submit_answer",
        "description": "Submit final answer. Ends conversation.",
        "input_schema": {
            "type": "object",
            "properties": {
                "answer": {"type": "string"},
                "citations": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["answer", "citations"],
        },
    },
]
# agent/tools.py
TOOLS = [
    {
        "type": "function",
        "name": "search_docs",
        "description": (
            "Search the corpus by keyword. Returns up to "
            "5 matches, each with doc_id and 300-char snippet."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string"}
            },
            "required": ["query"],
            "additionalProperties": False,
        },
    },
    {
        "type": "function",
        "name": "fetch_doc",
        "description": "Fetch full text of a doc by doc_id.",
        "parameters": {
            "type": "object",
            "properties": {
                "doc_id": {"type": "string"}
            },
            "required": ["doc_id"],
            "additionalProperties": False,
        },
    },
    {
        "type": "function",
        "name": "submit_answer",
        "description": "Submit final answer. Ends conversation.",
        "parameters": {
            "type": "object",
            "properties": {
                "answer": {"type": "string"},
                "citations": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["answer", "citations"],
            "additionalProperties": False,
        },
    },
]

Writing the handlers

Handlers are pure Python — no provider-specific code. They're the same for both APIs.

# agent/tools.py (continued — shared)
from pathlib import Path

CORPUS = Path("corpus")

def search_docs(query: str) -> list[dict]:
    """Dumb substring scan. Intentionally bad."""
    results = []
    q = query.lower()
    for path in CORPUS.glob("*.md"):
        text = path.read_text(encoding="utf-8")
        if q in text.lower():
            idx = text.lower().find(q)
            start = max(0, idx - 100)
            end = min(len(text), idx + 200)
            results.append({
                "doc_id": path.stem,
                "snippet": text[start:end].strip(),
            })
        if len(results) >= 5: break
    return results

def fetch_doc(doc_id: str) -> dict:
    path = CORPUS / f"{doc_id}.md"
    if not path.exists():
        return {"error": f"no doc: {doc_id}"}
    return {
        "doc_id": doc_id,
        "content": path.read_text(encoding="utf-8"),
    }

HANDLERS = {
    "search_docs": search_docs,
    "fetch_doc": fetch_doc,
    # submit_answer is handled in the loop, not here
}

Testing the tools in isolation

Before touching the loop, prove the tools work. Open a Python REPL:

>>> from agent.tools import search_docs
>>> results = search_docs("connection pool")
>>> for r in results:
...     print(r["doc_id"], "→", r["snippet"][:60])
runtime-config-connection → ...connection pool can hold up to max_connections...
runtime-config-resource → ...each connection consumes shared memory, so pool size...
pgbouncer-modes → ...connection pooling in PgBouncer operates in three distinct...
libpq-connect → ...PQconnectdb opens a single connection; for connection pool...
ddl-system-columns → ...system catalogs in the connection pool are shared...

Five real results, each with a snippet. Crude but functional.

Question
The two schemas differ in two ways. What's going on with additionalProperties and the top-level "type": "function"?

Top-level type: Anthropic only has one kind of tool (function calls) so there's no need for a type discriminator. OpenAI Responses supports built-in tools (web_search, file_search, computer_use) alongside functions, so "type": "function" distinguishes user-defined functions from those built-ins.

additionalProperties: false: This enables OpenAI's "strict mode," which guarantees the model's arguments will exactly match your schema. Without it, the model might invent fields. Optional but recommended. Anthropic's API enforces schema conformance differently and doesn't need this flag.

Question
Why not let the agent see all 5 snippets and the full docs in one tool call?

Context window is a resource the agent has to manage. If search_docs returned full documents, every search would dump ~20k tokens into the conversation. After two searches the agent is drowning; after four it hits the context limit.

Snippets-then-fetch forces the agent to decide which docs are worth the cost. That decision is a behavior we want to see in traces.

Don't add a list_all_docs tool. You'll be tempted, but it teaches the wrong instinct — you want the agent to reason about queries, not browse-and-grep.

STEP 3

Write the loop. Make it ugly.

The core idea of agents: a while-loop where each iteration calls the model, executes any tools it asked for, and feeds results back. Everything else — frameworks, orchestration, agent SDKs — is decoration around this loop.

The mental model

┌─────────────────────────────────────────────┐ │ history = [user query] │ │ │ │ loop: │ │ response = model.call(history, tools) │ │ history.append(response) │ │ │ │ if response wants to submit: │ │ → return answer │ │ │ │ for each tool_call in response: │ │ result = run_tool(name, args) │ │ history.append(tool_result) │ │ │ │ (repeat) │ └─────────────────────────────────────────────┘

The history list grows on every iteration. The model sees the full history on every turn — it remembers what it searched, what came back, what it decided. That's the agent's "memory" for this conversation.

This mental model is identical for both APIs. They differ in names: Anthropic calls the history messages, OpenAI calls it input. Anthropic returns content blocks; OpenAI returns output items. Anthropic uses tool_use/tool_result; OpenAI uses function_call/function_call_output. Same shape, different vocabulary.

The system prompt

Identical for both APIs.

# agent/prompts.py
SYSTEM_PROMPT = """You are a research assistant for a documentation corpus.

You have three tools:
- search_docs(query): find relevant documents
- fetch_doc(doc_id): read a full document
- submit_answer(answer, citations): finish

Process:
1. Search for terms related to the user's question.
2. If a snippet looks promising, fetch the full doc.
3. Repeat until you have enough to answer confidently.
4. Submit your answer with citations to doc_ids you used.

Rules:
- Every claim must be supported by a cited doc.
- If the corpus doesn't have the answer, say so honestly.
- Don't search for the same thing twice.
- Aim for 3-6 tool calls before submitting."""

The loop itself

# agent/loop.py — Anthropic Messages API
from anthropic import Anthropic
from agent.tools import TOOLS, HANDLERS
from agent.prompts import SYSTEM_PROMPT

client = Anthropic()

def run_agent(user_query: str, max_steps: int = 10):
    messages = [{"role": "user", "content": user_query}]
    trace = []

    for step in range(max_steps):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            system=SYSTEM_PROMPT,
            tools=TOOLS,
            messages=messages,
        )

        # Append assistant turn — required by API
        messages.append({
            "role": "assistant",
            "content": response.content,
        })
        trace.append({"step": step, "response": response})

        if response.stop_reason == "end_turn":
            return {"status": "halted_no_answer",
                    "trace": trace}

        tool_results = []
        for block in response.content:
            if block.type != "tool_use":
                continue

            if block.name == "submit_answer":
                return {
                    "status": "answered",
                    "answer": block.input["answer"],
                    "citations": block.input["citations"],
                    "steps_used": step + 1,
                    "trace": trace,
                }

            try:
                result = HANDLERS[block.name](**block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": str(result),
                })
            except Exception as e:
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": f"error: {e}",
                    "is_error": True,
                })

        # All tool results in one user turn
        messages.append({"role": "user",
                         "content": tool_results})

    return {"status": "step_limit", "trace": trace}
# agent/loop.py — OpenAI Responses API
import json
from openai import OpenAI
from agent.tools import TOOLS, HANDLERS
from agent.prompts import SYSTEM_PROMPT

client = OpenAI()

def run_agent(user_query: str, max_steps: int = 10):
    # Responses API uses a flat list of "items"
    input_items = [
        {"role": "user", "content": user_query}
    ]
    trace = []

    for step in range(max_steps):
        response = client.responses.create(
            model="gpt-5.5",
            instructions=SYSTEM_PROMPT,
            tools=TOOLS,
            input=input_items,
        )

        # Append every output item to history
        for item in response.output:
            input_items.append(item.model_dump())
        trace.append({"step": step, "response": response})

        # Find function_call items in this turn
        calls = [i for i in response.output
                 if i.type == "function_call"]

        if not calls:
            # Model produced text without calling a tool
            return {"status": "halted_no_answer",
                    "trace": trace}

        for call in calls:
            args = json.loads(call.arguments)

            if call.name == "submit_answer":
                return {
                    "status": "answered",
                    "answer": args["answer"],
                    "citations": args["citations"],
                    "steps_used": step + 1,
                    "trace": trace,
                }

            try:
                result = HANDLERS[call.name](**args)
                output = str(result)
            except Exception as e:
                output = f"error: {e}"

            # Match output to call by call_id
            input_items.append({
                "type": "function_call_output",
                "call_id": call.call_id,
                "output": output,
            })

    return {"status": "step_limit", "trace": trace}

Same logic — different shapes

The loops are structurally identical: send history + tools, append response, dispatch any tool calls, append results, repeat. But the data shapes differ in ways worth understanding.

Anthropic uses messages with content blocks. Each turn is a {role, content} object where content is either a string (for user input) or a list of blocks (for tool calls, results, and text). The API enforces strict turn alternation: assistant turns must come right before any user turn that carries tool_results.

OpenAI Responses uses a flat item list. The input parameter is a list of items, each with a type: user messages, function_call items, function_call_output items, and so on. No strict alternation; items are correlated by call_id instead of by position.

The most common bug, in both APIs

anthropic.BadRequestError: Error code: 400 -
{'error': {'type': 'invalid_request_error',
'message': 'messages.1: tool_result block found without
corresponding tool_use block'}}

Your loop forgot to append the assistant turn before the user turn that carries the tool results. Fix the append order.

openai.BadRequestError: Error code: 400 -
{'error': {'message': 'No tool call found for
function_call_output with call_id call_xyz...',
'type': 'invalid_request_error'}}

You appended a function_call_output without the matching function_call in input. You forgot to append the model's output items before adding the tool result.

Different error messages, same underlying mistake: tool results need their corresponding calls in the history.

Question
What's stop_reason in Anthropic, and what's its OpenAI equivalent?

Anthropic returns response.stop_reason with values like "tool_use" (wants to call tools, loop continues) or "end_turn" (finished without tools, loop ends).

OpenAI Responses doesn't expose a single stop_reason the same way. Instead, you inspect response.output — if it contains any function_call items, dispatch them; if it only contains text/message items, the model is done.

Question
Why does OpenAI use call_id while Anthropic uses tool_use_id?

Same concept, different name. Both are opaque strings the API generates to correlate a tool call with its result. The model returns it on the call; you return it on the output; the API matches them up.

Why they exist: when the model makes multiple tool calls in one turn (parallel tools, covered in Phase 3), the API needs to know which result goes with which call. Without IDs, you'd rely on position order which is fragile.

Question
Why max_steps=10?

Empirical. For a 3-tool research agent on a small corpus, 10 steps is generous — most queries finish in 4–6. The bound exists for safety: if the model gets stuck in a loop, we don't want it burning API credits forever.

If most runs hit the step limit, the bound isn't the problem — the agent is stuck. Read the trace and fix the underlying issue (usually a prompt problem).

STEP 4

Run it. Watch every trace. Take notes.

Now we run the agent on real questions and read what it does, line by line. This is the most important step in Phase 1. Everything in Phases 2 and 3 is a response to behavior we see here.

Example 1: a clean run

$ python scripts/run.py "What's the default PostgreSQL port?"

Here's the trace you'd see — identical regardless of which API you used:

──────────────────── Step 0 ────────────────────
┌─ thinking ─────────────────────────────────────┐
│ I'll search for "default port" to find the     │
│ PostgreSQL configuration.                      │
└────────────────────────────────────────────────┘
→ search_docs({'query': 'default port'})

──────────────────── Step 1 ────────────────────
┌─ thinking ─────────────────────────────────────┐
│ The snippet from runtime-config-connection     │
│ mentions port 5432. Let me fetch the full doc  │
│ to confirm.                                    │
└────────────────────────────────────────────────┘
→ fetch_doc({'doc_id': 'runtime-config-connection'})

──────────────────── Step 2 ────────────────────
→ submit_answer({
    'answer': 'PostgreSQL listens on TCP port
    5432 by default. This can be changed via the
    `port` parameter in postgresql.conf.',
    'citations': ['runtime-config-connection']
  })

─────────────── Final answer ───────────────
PostgreSQL listens on TCP port 5432 by default.
This can be changed via the `port` parameter in
postgresql.conf.
citations: ['runtime-config-connection']
status: answered
What to notice

Three steps, clean trajectory. Search → fetch → submit. The agent didn't over-research a simple question, didn't fetch unnecessary docs, cited exactly what it used.

The "thinking" blocks are real model output — both Claude and GPT often emit brief text alongside tool calls explaining their plan. Useful for debugging.

Example 2: a multi-hop question

$ python scripts/run.py "How does PgBouncer's transaction pooling
   interact with prepared statements?"
──────────────────── Step 0 ────────────────────
→ search_docs({'query': 'PgBouncer transaction pooling'})

──────────────────── Step 1 ────────────────────
┌─ thinking ─────────────────────────────────────┐
│ Got pgbouncer-modes. I need to fetch the full  │
│ doc AND search for prepared statements.        │
└────────────────────────────────────────────────┘
→ fetch_doc({'doc_id': 'pgbouncer-modes'})

──────────────────── Step 2 ────────────────────
→ search_docs({'query': 'prepared statements'})

──────────────────── Step 3 ────────────────────
→ fetch_doc({'doc_id': 'sql-prepare'})

──────────────────── Step 4 ────────────────────
┌─ thinking ─────────────────────────────────────┐
│ Prepared statements are session-scoped. In     │
│ transaction pooling, server connections rotate │
│ between transactions. They conflict.           │
└────────────────────────────────────────────────┘
→ submit_answer({
    'answer': 'Transaction pooling and prepared
    statements conflict. Prepared statements are
    scoped to a session, but transaction pooling
    rotates server connections between
    transactions...',
    'citations': ['pgbouncer-modes', 'sql-prepare']
  })

status: answered (5 steps)
What to notice

Two-topic decomposition. The agent recognized this is a question about two things and searched separately. Correct multi-hop behavior.

Sequential search-then-fetch. Each search yields one fetch decision, not all five. The snippets-then-fetch design is working.

Example 3: a failure mode you'll definitely see

Now a question where the agent goes wrong. Crude substring search makes Phase 1 failure modes show up clearly:

$ python scripts/run.py "When should I VACUUM versus VACUUM FULL?"
──────────────────── Step 0 ────────────────────
→ search_docs({'query': 'VACUUM'})

──────────────────── Step 1 ────────────────────
┌─ thinking ─────────────────────────────────────┐
│ Got 5 results but they all mention VACUUM      │
│ briefly. Let me search more specifically.      │
└────────────────────────────────────────────────┘
→ search_docs({'query': 'VACUUM FULL'})

──────────────────── Step 2 ────────────────────
→ search_docs({'query': 'when to use VACUUM'})

──────────────────── Step 3 ────────────────────
┌─ thinking ─────────────────────────────────────┐
│ The third search returned nothing — substring  │
│ "when to use" doesn't appear literally. Let    │
│ me try fetching routine-vacuuming directly.    │
└────────────────────────────────────────────────┘
→ fetch_doc({'doc_id': 'routine-vacuuming'})

──────────────────── Step 4 ────────────────────
→ submit_answer({...})

status: answered (5 steps)
Three problems in one run

Problem 1: redundant searching. Three searches because substring match returned too-broad results. Semantic search would have hit the right doc on first try.

Problem 2: phrase search fails on natural language. "when to use VACUUM" returned nothing because no doc contains that literal phrase.

Problem 3: recovery by lucky guess. It guessed routine-vacuuming because that's a plausible doc name. On another corpus this wouldn't work.

This is the data we need. Three concrete reasons to upgrade retrieval in Phase 2.

The trace log

Don't just watch traces — write them down. Keep runs/notes.md:

# Phase 1 trace log

## 2026-05-16 — first runs

### Q: "default postgres port"
- 3 steps, clean. ✓ search → fetch → submit

### Q: "PgBouncer pooling + prepared statements"
- 5 steps, correct answer
- ✓ decomposed into two sub-searches

### Q: "VACUUM vs VACUUM FULL"
- 5 steps, eventually correct
- ✗ 3 redundant searches before progress
- ✗ "when to use VACUUM" returned 0 results
- ✗ recovered by guessing doc_id, lucky
→ Phase 2 needs: semantic similarity, reranker

### Q: "How do I configure SSL?"
- 8 steps, hit token limit on fetch
- ✗ libpq-ssl doc is ~30k tokens
→ need chunking, not full-doc fetches

This log is the design pressure for Phase 2. You'll re-read it before writing the retrieval stack.

Question
If I built both Anthropic and OpenAI versions, how should they compare?

On simple queries, near-identical — both models handle this kind of tool use well. Differences you might notice:

  • Strict mode. OpenAI's additionalProperties: false makes argument parsing more reliable on complex schemas. Anthropic's models tend to follow schemas well without this hint.
  • Thinking style. Claude tends toward brief plans, GPT toward more verbose narration. Adjust your system prompt if you want one style.
  • Recovery from tool errors. Both recover well. We'll test this systematically in Phase 4.
Question
My agent answered correctly even with crude search. Do I need Phase 2?

Probably your corpus is small or questions are simple enough that substring matching covers them. Try harder questions: multi-hop, paraphrased terms, conceptual questions where the answer is implied. Crude search will fail.

If it still works, you've found a useful truth: retrieval complexity should match question complexity. For a 50-doc personal wiki, BM25 might be enough forever. For 50,000 enterprise docs with vague questions, you need everything in Phase 2.

Question
The agent sometimes ends with halted_no_answer. How do I fix it?

The model produced text without calling submit_answer. Common causes:

  • Model thinks it already answered. Tool result contained the answer verbatim; model paraphrased without the tool call.
  • Model gave up. Several searches returned nothing; model said "I can't find this." Fix: prompt should say "call submit_answer with 'not in corpus' if the answer isn't there."

Usually a prompt tweak, not a code change.

Save trace logs forever. When tuning Phase 4 evals, you'll want to remember which questions were hard in Phase 1 — they make excellent test cases.

End of week 1

Deliverable

A working CLI that returns answers with citations or fails gracefully. A trace log of ~10 runs with observations. You should be able to articulate the top 3 failure modes that motivate Phase 2.

  • Agent loop in under 100 lines of Python
  • Three tools with descriptions, no framework
  • Pretty-printed traces with rich
  • 10+ runs logged with notes
  • Top 3 failure modes identified and named