Data & Analytics Agents

Y2
Playbook · Domain Playbooks

Data & analytics agents: the failure mode is a confidently wrong number.

A text-to-SQL or notebook agent feels magical in a demo and dangerous in production for one reason: a number with no error bars and a fluent narrative is indistinguishable from a correct answer until someone makes a decision on it. This playbook treats the analytics agent as an instrument that must report its own uncertainty: how to ground it in the schema, how to verify the result it produces, the autonomy that fits an irreversible business decision, and why the wrong number — delivered confidently — is the only failure mode that matters.

STEP 1

The job is a defensible number, not a plausible one.

A traditional bug returns an error; an analytics agent's bug returns $4.2M with a confident sentence. The output is structurally trustworthy-looking — a clean figure, a chart, prose — regardless of whether the query joined the wrong table or filtered out half the rows. The job is therefore not "answer the question" but "produce a number a decision can survive being wrong about, or refuse." A wrong dashboard number that nobody catches becomes a wrong board slide becomes a wrong strategy. An analytics agent that says "I am not confident, here is the query, check it" is more valuable than one that always answers, because the second one is occasionally and invisibly catastrophic.

STEP 2

Schema grounding is the whole accuracy problem.

Most wrong numbers are not arithmetic errors — they are semantic ones: the model joined orders to users on the wrong key, used created_at when the business means closed_at, or summed a column that is already an average. The fix is not a smarter model, it is grounding: feed the agent the real schema, column descriptions, known join paths, and a curated set of verified example queries — and constrain it to a read-only role on a warehouse view, never raw production.

# ground in schema + verified examples, run read-only, then verify
sql = model(question, schema=catalog, examples=verified_queries)
assert is_select_only(sql) and tables(sql) <= allowed_views
rows = warehouse.run(sql, role="analytics_ro", row_limit=1e6)

Semantic layers and metric stores (dbt metrics, a metrics catalog) move the agent from "write SQL against raw tables" to "select a governed metric." A pre-defined, reviewed revenue metric the agent merely parameterizes removes the entire class of join/filter errors.

STEP 3

Verification is a separate step, not a vibe.

The single highest-leverage design move is to never trust the first query. Make verification an explicit pipeline stage: run sanity checks the agent did not write — row counts within expected magnitude, no silent NULL-collapsing joins, totals reconciling against a known control figure, and a second independently-generated query that should produce the same answer. Disagreement between two derivations of the same number is the cheapest hallucination detector you have, and it costs one extra query.

An agent asked to check its own work will usually rationalize its first answer — self-verification in the same context is theater. Verification must be a fresh derivation or an external assertion (a control total, a constraint), not "are you sure?" appended to the prompt.

STEP 4

The right autonomy: autonomous to explore, gated to decide.

Match autonomy to the reversibility of what the number feeds. Exploratory analysis — a human iterating in a notebook, eyeballing every result — can be fully autonomous; the human is the verifier in the loop. A scheduled metric that lands on an executive dashboard with no human between query and decision is the dangerous configuration: there, the agent's query must be reviewed once and pinned, not regenerated nightly by a non-deterministic model. The rule: the less a human will scrutinize the number, the less the agent should be allowed to author it freshly.

STEP 5

The eval signal: correctness on a labeled question bank.

"Looks reasonable" is not an eval. The signal that steers an analytics agent is execution accuracy on a curated bank of natural-language questions with known-correct answers, scored on:

  • Result correctness — does the returned value match the verified ground truth, exactly or within tolerance; this is the only metric the business cares about.
  • Query validity — does it execute, scoped read-only, without scanning the whole warehouse or timing out.
  • Calibrated abstention — on questions the schema cannot answer, does it say so instead of fabricating a join; a confident wrong answer is scored far below a correct "I cannot answer that."

Track the confidently-wrong rate as a first-class number and refuse to ship a model that lowers it by also lowering correct-answer coverage you cannot afford — but never the reverse.

STEP 6

The data & analytics checklist.

  • Grounded in real schema, column semantics, join paths, and verified example queries — ideally a governed metric/semantic layer.
  • Read-only role on a warehouse view; SELECT-only assertion and table allow-list enforced before execution.
  • Verification is a separate stage: control totals, magnitude sanity checks, and a second independent derivation must agree.
  • Self-"are you sure?" is not verification; use fresh derivations or external constraints.
  • Autonomy scales with human scrutiny: free in exploration, query reviewed-and-pinned for unattended scheduled metrics.
  • Every answer ships with its query and assumptions so a human can audit the derivation.
  • Eval is execution accuracy on a labeled question bank; confidently-wrong rate tracked as a hard, heavily-weighted metric.

The honest tradeoff: an analytics agent that abstains and surfaces its query is slower and less impressive than one that always answers — but the always-answers version is a number generator, not an analyst, and the difference is invisible exactly until the decision that depends on it.