Search strategies — branching over trajectories.
When a single trajectory is too fragile, you explore several and select the best. Best-of-N, self-consistency, tree-of-thought and graph-of-thought are points on one spectrum: spend more inference compute to widen the search, then pay a selector to pick. This essay covers the spectrum, the multiplicative cost, and the load-bearing dependency on a usable scoring signal.
The unifying idea: generate candidates, then select.
Single-trajectory patterns (ReAct, plan-and-execute) commit to one path. If an early step is wrong, the whole run is wrong. Search patterns hedge by exploring multiple paths and choosing among them. Every search method answers three questions: how do we branch (sample N independent answers, or expand a tree of partial states), how do we score a node or a finished candidate, and how do we select the winner. Get the scoring wrong and the whole family collapses — that is the recurring theme.
- Best-of-N / rejection sampling. Generate N independent attempts, score each, return the best. No structure shared between attempts. Embarrassingly parallel.
- Self-consistency. Sample N chains of thought, take the majority answer. The "scorer" is a vote — works only when the answer space is small and discrete (final numeric answer, label).
- Tree-of-thought (ToT). Decompose the problem into steps; at each step generate several candidate continuations, score partial states, and search (BFS/DFS/beam) with backtracking. Spends compute where the problem is hard.
- Graph-of-thought (GoT). Generalizes the tree to a DAG: thoughts can be merged and refined, not only branched. Powerful for problems where partial solutions should be combined; the most engineering-heavy.
The shape of a tree search.
# Tree-of-thought, beam search variant frontier = [root(task)] for depth in range(MAX_DEPTH): candidates = [] for node in frontier: for thought in generator.expand(node, k=BRANCH): child = node.extend(thought) child.score = evaluate(child) # the load-bearing call candidates.append(child) frontier = top_k(candidates, BEAM_WIDTH) # prune; backtrack implicit if any(c.is_solution for c in frontier): break return best(frontier)
Note the cost: model calls scale as roughly BRANCH × BEAM_WIDTH × MAX_DEPTH, and each node also pays an evaluate() call. A modest tree (branch 3, beam 3, depth 4) is on the order of dozens of model calls for one answer. This is not a tweak; it is a different cost regime.
The scoring signal decides whether any of this works.
Every search method's quality is capped by the quality of its evaluator. The hierarchy of signals, best to worst:
- Ground-truth verifier (tests pass, equation balances, tool succeeds). Search shines here — it is parallel hypothesis testing against an oracle. This is where ToT/best-of-N give their largest, most reliable gains.
- Structured rubric / programmatic heuristic. Decent when the rubric is concrete and discriminative.
- LLM-as-judge self-evaluation. Weak and biased. The judge often cannot rank candidates better than chance on the hard cases that motivated search in the first place. Best-of-N over a bad judge just finds the answer the judge is most confident about, not the correct one.
- Majority vote (self-consistency). Works only when the correct answer is the modal sample. Fails when the model is systematically biased toward a wrong answer — voting amplifies the bias.
The dominant anti-pattern: deploying tree search with an LLM self-evaluator on a task with no external check. You pay 30x the inference cost to explore a space you cannot navigate, and the selector confidently returns a wrong leaf. Before adding branching, prove you have a discriminative scorer. Architecture amplifies a signal; it cannot manufacture one.
When to reach for search — and when not.
Use it when: the task has a verifiable or sharply discriminable answer (competitive math, code with tests, constraint satisfaction, planning/puzzles); single-pass accuracy is the bottleneck and you have measured it; latency and cost budgets can absorb a 5–30x multiplier; the value of a correct answer dwarfs the inference cost.
Avoid it when: outputs are open-ended with no discriminative scorer (free-form writing, advice) — branching just samples more unverifiable text; the task is latency-sensitive or high-QPS — the cost multiplier is unaffordable; a cheaper pattern already clears the quality bar. Always exhaust ReAct and a verifier-backed reflection loop before introducing tree search; it is the heaviest pattern in this section and the easiest to deploy where it does not pay.
Pragmatic ladder: start with best-of-N (trivial to add, embarrassingly parallel, often 80% of the gain). Move to self-consistency only for small discrete answer spaces. Reserve full ToT/GoT for problems where partial-state scoring and backtracking demonstrably beat independent samples — that is a minority of real workloads.
The honest tradeoff: search buys robustness against unlucky single trajectories at a multiplicative inference cost, and it pays off strictly in proportion to how good your scorer is. With a ground-truth verifier it is one of the strongest levers available; without one it is the most expensive way to be confidently wrong.