Research & synthesis agents: the failure mode is a fluent answer built on a source that does not exist.
A deep-research agent that browses, reads, and writes a cited report is one of the most useful agents of 2025–2026 and one of the easiest to trust too much: it produces the surface form of rigor — sections, citations, confident synthesis — whether or not the sources are real or say what the agent claims. This playbook is the opinionated version: multi-source synthesis without false consensus, citation discipline as a hard constraint, depth-versus-breadth as a budget decision, and why a fabricated source is the only failure that destroys the whole deliverable.
The job is a defensible synthesis, not a confident essay.
A research agent's output is structurally indistinguishable from a good one: headings, a narrative, footnotes. The reader cannot see whether citation [4] supports the sentence it is attached to, whether two "independent" sources are the same press release reworded, or whether a key claim rests on a URL the agent never actually fetched. The job is not "write a report on X" — it is "produce claims a skeptical reader can verify by following the citations, and flag what you could not confirm." A research agent that says "sources disagree, here is the spread, and I could not verify this figure" is doing the job; one that resolves every tension into a smooth confident narrative is hiding the work that matters.
Citation discipline is a hard constraint, enforced after retrieval.
The non-negotiable rule: every non-trivial claim is attributed to a source the agent actually retrieved, and the attribution is verified — not generated. The dangerous failure is a hallucinated citation: a plausible title, author, and year for a paper that does not exist, or a real URL that does not contain the claimed fact. Enforce attribution as a post-step that re-checks each citation against the fetched content, and drop or flag claims that fail.
# every claim must map to fetched content; verify, do not trust for claim, cite in draft.claims: doc = fetched.get(cite.url) # must have been actually retrieved if doc is None or not supports(doc, claim): claim.flag("unverified") # do not silently keep it
An agent asked to "add citations" to an already-written draft will reverse-engineer plausible-looking sources to fit the prose — citations bolted on after the fact are the single richest source of fabricated references. Retrieve first, write from what was retrieved, then verify; never write then cite.
Multi-source synthesis must preserve disagreement, not average it.
The value of a research agent is cross-source synthesis; the trap is manufacturing a false consensus by smoothing over real disagreement. Three sources saying the same thing because they all cite one original is one source, not three — corroboration requires independent provenance. The agent must distinguish agreement from echo, surface conflicting estimates as a range rather than a midpoint, and treat a single unconfirmed source as a lead, not a fact. Synthesis that erases uncertainty is not synthesis, it is laundering.
Depth versus breadth is a budget decision, made explicit.
An unbounded research agent either reads three sources shallowly and misses the answer, or chases citations forever and burns the budget. Depth versus breadth is not an emergent behavior to hope for — it is a planning decision the agent should make explicit and you should bound: a fixed source budget, a stopping rule (stop when new sources stop changing the conclusion, not when the agent feels done), and a structure that allocates more depth to load-bearing claims and less to background. The right autonomy: autonomous gathering and drafting under a hard source/step/cost budget, with the synthesis surfaced for human judgment when stakes are high.
A good stopping rule is information-gain based: keep gathering while each new source still moves a key claim or its confidence; stop when sources only repeat what you have. "Read N sources" is a budget; "read until marginal sources stop changing the answer" is research.
The eval signal: verifiable claims, not reader satisfaction.
"The report reads well" is the metric that lets fabrication through — fluent and wrong is the exact failure. The eval that steers a research agent grades, on a bank of questions with known answers and known good sources:
- Claim accuracy — sampled claims checked against ground truth; a confidently stated wrong fact is weighted far below a hedged correct one.
- Citation faithfulness — sampled citations verified to exist and to actually support the attached claim; a single fabricated source is a report-level failure.
- Coverage and calibration — did it find the sources that mattered, and did it correctly flag what it could not confirm rather than paper over it.
The research & synthesis checklist.
- Every non-trivial claim attributed to an actually-retrieved source; attribution verified in a post-step, not generated.
- Retrieve → write from retrieved → verify; never write then bolt on citations.
- Corroboration requires independent provenance; conflicting estimates shown as a range, not a midpoint.
- Single unconfirmed source is a lead, not a fact; uncertainty is preserved, not smoothed.
- Depth vs breadth is a bounded budget with an information-gain stopping rule.
- Autonomous within a hard source/step/cost budget; synthesis surfaced for human judgment on high-stakes work.
- Eval grades claim accuracy and citation faithfulness against ground truth; one fabricated source fails the report.
The honest tradeoff: a research agent that flags uncertainty and refuses to invent corroboration produces shorter, hedgier, less impressive reports — but the impressive version's polish is precisely the camouflage over a single fabricated source that voids the entire deliverable.