Tree and graph of thought are search — and search is only as good as its scorer.
Tree-of-Thought (Yao et al., 2023) decomposes a problem into steps, branches several candidate continuations per step, scores partial states, and searches with backtracking. Graph-of-Thought generalizes the tree to a DAG where thoughts can be merged and refined, not just branched. Both are deliberate search over partial solutions. This essay covers the algorithm, the multiplicative cost, and the single dependency that makes or breaks the whole family: the state evaluator.
The generalization ladder: IO ⊂ CoT ⊂ CoT-SC ⊂ ToT ⊂ GoT.
These are not separate ideas; they are points on one axis of search structure. Direct IO is a single point. CoT is one path. Self-consistency is a set of independent paths voted at the leaves. ToT adds intermediate branching with a state evaluator and backtracking — you can abandon a bad partial line and explore a sibling. GoT adds aggregation: distinct partial solutions can be merged into a stronger one, and a thought can be refined in place. The price of moving up the ladder is exploding compute and a growing dependency on being able to score things you have not finished yet.
The shape of the search.
# Tree-of-thought, beam-search variant frontier = [root(task)] for depth in range(MAX_DEPTH): candidates = [] for node in frontier: for thought in generator.expand(node, k=BRANCH): child = node.extend(thought) child.score = evaluate(child) # the load-bearing call candidates.append(child) frontier = top_k(candidates, BEAM) # prune; backtrack is implicit if any(c.is_solution for c in frontier): break return best(frontier)
Two LLM calls live in this loop: expand (the generator) and evaluate (the scorer). Their product, not their sum, sets the cost.
The scorer is the entire system; everything else is plumbing.
Search does not create correctness — it amplifies a scoring signal. Quality is capped by the evaluator. Ranked best to worst: a ground-truth verifier on partial state (tests, a solver, a checkable invariant) — search shines, this is parallel hypothesis testing against an oracle; a concrete programmatic heuristic — decent if discriminative; an LLM self-evaluation — weak and biased, and worst precisely on the hard nodes that motivated search. The hardest case in ToT/GoT is that you must score partial work — a half-finished derivation — which is strictly harder than judging a finished answer, and exactly where LLM scorers are least reliable.
The dominant anti-pattern: deploying ToT with an LLM self-evaluator on a task with no external check. You pay an order of magnitude more compute to explore a space you cannot navigate, and the search confidently returns a wrong leaf. Architecture amplifies a signal; it cannot manufacture one. Prove you have a discriminative partial-state scorer before you add branching.
The cost is multiplicative, and it compounds the scorer cost.
Generator calls scale roughly as BRANCH × BEAM × MAX_DEPTH, and every generated node also pays an evaluate call. A modest tree (branch 3, beam 3, depth 4) is on the order of dozens of generator calls plus dozens of scorer calls for one answer — easily 20–50x a single CoT pass, before GoT's merge/refine steps add more. This is not a hyperparameter tweak; it is a different cost and latency regime. GoT is also the most engineering-heavy: you own state representation, the merge operator, cycle avoidance, and a controller — operational surface that single-pass methods do not have.
When tree/graph search actually beats best-of-N.
ToT/GoT only earns its multiplier over the far simpler best-of-N when the problem has genuine intermediate structure: pruning a doomed prefix early saves the compute that would have finished it (Game of 24, constraint puzzles, multi-step planning, proof search), or partial solutions are composable so merging beats restarting (GoT's sweet spot — sorting/merging, multi-document synthesis with reconcilable parts). If the task has no scorable partial state, or independent samples already converge, the tree's backtracking and aggregation buy nothing and best-of-N wins on cost.
Pragmatic ladder: best-of-N first (trivial, embarrassingly parallel, often 80% of the gain), self-consistency only for small discrete answers, and ToT/GoT only where partial-state scoring and backtracking measurably beat independent samples on a held-out set. That last condition holds for a minority of real workloads — make it pass before you ship the search.
The honest tradeoff.
Tree/graph of thought buys the ability to prune dead ends and compose partials at an order-of-magnitude multiplicative cost — and it pays off strictly in proportion to how good your partial-state scorer is. With a ground-truth partial verifier it is one of the strongest levers in this section; without one it is the most expensive way to be confidently wrong.