Rollout, versioning and pinning: the model changed under you and nobody deployed.
The number-one production surprise with agents is not a bug you wrote — it is behavior shifting when you changed nothing, because the model provider rotated a checkpoint behind a floating alias. An agent's behavior is a function of three coupled artifacts — model, prompt, tools — and shipping any of them is a deploy whether your CI knows it or not. This essay is about treating that triple as a versioned, pinned, canaried, rollback-able release.
"The model changed under us" is the default, not the exception.
Calling gpt-4o or claude-sonnet by a floating alias means the provider can — and does — move that pointer to a new checkpoint with different behavior, on their schedule, with no diff in your repo and no entry in your changelog. Your evals passed last week against a model that no longer exists at that name. An unpinned model is a continuous, silent, un-reviewed deploy of your most behavior-critical dependency.
The mental reframe that fixes this: the agent's behavioral contract is the tuple (model_version, prompt_version, tool_schema_version). Two of those live in your repo and you already version them by reflex. The third is the one that moves without you — so it must be pinned with the same discipline as a library version, not referenced by a moving tag.
Pin the whole triple and stamp it on every run.
Pin to immutable, dated model snapshots — never a bare family alias. Version prompts and tool schemas as content-addressed artifacts. Then record the resolved triple in the run's journal, so every trace is attributable to an exact behavioral contract.
# release/pin.py — one immutable behavioral contract RELEASE = { "model": "claude-sonnet-2025-09-01", # dated snapshot, NOT a float "prompt": "sha256:9af3...e1", # content-addressed "tools": "toolset@v7", # pinned schema set } def start_run(req): rel = active_release(req.tenant) # may be canary or stable journal.record(req.run_id, 0, "PLAN", {"req": req, "release": rel}) # stamp it return rel # run is bound to THIS triple
Stamping the release on the run also fixes the resumability hazard from durable-state-and-resumability: a run that crashes under release R must resume under R, not whatever is current. A run whose first half was produced by one model and second half by another is an un-debuggable trace.
Canary and shadow: prove the new triple before it touches everyone.
A new model snapshot or prompt edit is a behavior change with no compiler to catch regressions. Roll it out as a controlled experiment, not a flip:
- Shadow — run the candidate triple alongside the live one on real traffic, serve only the live result, and diff outputs offline. Zero user risk; surfaces behavioral drift before any customer sees it. The best first gate for a model snapshot bump.
- Canary — route a small, well-chosen slice (say 1–5%, ideally internal/low-stakes tenants first) to the new triple, with automatic comparison of quality and safety metrics against the control slice.
- Progressive promotion — widen only while the gate stays green; the moment a guardrail metric regresses, freeze the rollout — do not "monitor and see."
The eval gate is the promotion criterion — no green eval, no promote.
Canary tells you what production thinks; the eval suite tells you whether you should have asked. Promotion from canary to full rollout must be gated on a versioned offline eval run against the candidate triple, comparing to the incumbent on the metrics that matter — task success, regression set, and the adversarial/safety suite. This is not optional polish; it is the only thing standing between "the provider's new checkpoint is 3% better at coding and 20% worse at refusing prompt injection" and that shipping silently.
Provider model upgrades are not monotone. A newer snapshot can be better on the headline benchmark and worse on your task or your safety surface. Treat every provider snapshot change as a candidate that must clear the same eval gate as your own prompt change — the vendor's release notes are not your eval.
Rollback discipline: a release you cannot revert in minutes is not a release.
Because behavior regressions are often subtle and only visible in aggregate, the recovery primitive must be instant and boring:
- Rollback is a config flip, not a redeploy. The active release is data (a pointer per tenant/segment), so reverting is changing one value, not rebuilding and shipping under incident pressure.
- Keep N previous releases warm. The prior triple — model snapshot included — must remain callable; do not let a pin point at a provider snapshot that gets retired out from under your rollback path.
- In-flight runs finish on their stamped release. Rollback changes what new runs get; it must not retroactively rewrite the behavioral contract of runs already mid-loop.
- One reason to roll back is enough. A guardrail-metric regression is a revert, not a debate. Investigate from the safe state.
When strict pinning is more rigor than the use case needs.
Pinning, shadowing, and an eval gate are real release machinery with real latency-to-upgrade: a low-stakes internal summarizer where any reasonable model is fine, and a wrong answer costs nothing, does not need a canary pipeline — a floating alias and a spot check is proportionate. The cost of being unpinned is paid in behavioral surprise per dollar of impact: it is negligible for a throwaway helper and catastrophic for an agent that moves money or makes irreversible changes. Pin and gate in proportion to what a silent behavior shift would cost you — but never let an agent with real-world side effects ride a floating model alias.