Rollout, Versioning & Pinning

O5
Operation · AgentOps: Deploy & Operate

Rollout, versioning and pinning: the model changed under you and nobody deployed.

The number-one production surprise with agents is not a bug you wrote — it is behavior shifting when you changed nothing, because the model provider rotated a checkpoint behind a floating alias. An agent's behavior is a function of three coupled artifacts — model, prompt, tools — and shipping any of them is a deploy whether your CI knows it or not. This essay is about treating that triple as a versioned, pinned, canaried, rollback-able release.

STEP 1

"The model changed under us" is the default, not the exception.

Calling gpt-4o or claude-sonnet by a floating alias means the provider can — and does — move that pointer to a new checkpoint with different behavior, on their schedule, with no diff in your repo and no entry in your changelog. Your evals passed last week against a model that no longer exists at that name. An unpinned model is a continuous, silent, un-reviewed deploy of your most behavior-critical dependency.

The mental reframe that fixes this: the agent's behavioral contract is the tuple (model_version, prompt_version, tool_schema_version). Two of those live in your repo and you already version them by reflex. The third is the one that moves without you — so it must be pinned with the same discipline as a library version, not referenced by a moving tag.

STEP 2

Pin the whole triple and stamp it on every run.

Pin to immutable, dated model snapshots — never a bare family alias. Version prompts and tool schemas as content-addressed artifacts. Then record the resolved triple in the run's journal, so every trace is attributable to an exact behavioral contract.

# release/pin.py — one immutable behavioral contract
RELEASE = {
    "model":  "claude-sonnet-2025-09-01",  # dated snapshot, NOT a float
    "prompt": "sha256:9af3...e1",           # content-addressed
    "tools":  "toolset@v7",                # pinned schema set
}

def start_run(req):
    rel = active_release(req.tenant)        # may be canary or stable
    journal.record(req.run_id, 0, "PLAN",
                   {"req": req, "release": rel})  # stamp it
    return rel                              # run is bound to THIS triple

Stamping the release on the run also fixes the resumability hazard from durable-state-and-resumability: a run that crashes under release R must resume under R, not whatever is current. A run whose first half was produced by one model and second half by another is an un-debuggable trace.

STEP 3

Canary and shadow: prove the new triple before it touches everyone.

A new model snapshot or prompt edit is a behavior change with no compiler to catch regressions. Roll it out as a controlled experiment, not a flip:

  • Shadow — run the candidate triple alongside the live one on real traffic, serve only the live result, and diff outputs offline. Zero user risk; surfaces behavioral drift before any customer sees it. The best first gate for a model snapshot bump.
  • Canary — route a small, well-chosen slice (say 1–5%, ideally internal/low-stakes tenants first) to the new triple, with automatic comparison of quality and safety metrics against the control slice.
  • Progressive promotion — widen only while the gate stays green; the moment a guardrail metric regresses, freeze the rollout — do not "monitor and see."
STEP 4

The eval gate is the promotion criterion — no green eval, no promote.

Canary tells you what production thinks; the eval suite tells you whether you should have asked. Promotion from canary to full rollout must be gated on a versioned offline eval run against the candidate triple, comparing to the incumbent on the metrics that matter — task success, regression set, and the adversarial/safety suite. This is not optional polish; it is the only thing standing between "the provider's new checkpoint is 3% better at coding and 20% worse at refusing prompt injection" and that shipping silently.

Provider model upgrades are not monotone. A newer snapshot can be better on the headline benchmark and worse on your task or your safety surface. Treat every provider snapshot change as a candidate that must clear the same eval gate as your own prompt change — the vendor's release notes are not your eval.

STEP 5

Rollback discipline: a release you cannot revert in minutes is not a release.

Because behavior regressions are often subtle and only visible in aggregate, the recovery primitive must be instant and boring:

  • Rollback is a config flip, not a redeploy. The active release is data (a pointer per tenant/segment), so reverting is changing one value, not rebuilding and shipping under incident pressure.
  • Keep N previous releases warm. The prior triple — model snapshot included — must remain callable; do not let a pin point at a provider snapshot that gets retired out from under your rollback path.
  • In-flight runs finish on their stamped release. Rollback changes what new runs get; it must not retroactively rewrite the behavioral contract of runs already mid-loop.
  • One reason to roll back is enough. A guardrail-metric regression is a revert, not a debate. Investigate from the safe state.
STEP 6

When strict pinning is more rigor than the use case needs.

Pinning, shadowing, and an eval gate are real release machinery with real latency-to-upgrade: a low-stakes internal summarizer where any reasonable model is fine, and a wrong answer costs nothing, does not need a canary pipeline — a floating alias and a spot check is proportionate. The cost of being unpinned is paid in behavioral surprise per dollar of impact: it is negligible for a throwaway helper and catastrophic for an agent that moves money or makes irreversible changes. Pin and gate in proportion to what a silent behavior shift would cost you — but never let an agent with real-world side effects ride a floating model alias.