DevOps & SRE Agents

Y3
Playbook · Domain Playbooks

DevOps & SRE agents: read-only first, because the blast radius is production.

An agent with a kubectl, a cloud API, and a shell is the highest-leverage and highest-blast-radius agent you will deploy: the same call that fixes an incident can cause one. This playbook is the opinionated version of operations autonomy — why diagnosis and remediation are different products with different risk, why the agent earns write access by proving itself read-only, how runbooks become tools instead of prose, and why change control is not optional just because the operator is a model.

STEP 1

Diagnosis and remediation are two products, not one.

The instinct is "an agent that fixes outages." Split it. Diagnosis — correlate logs, metrics, traces, recent deploys; localize the fault; propose a hypothesis — is high value, low risk, and where the agent shines: it tirelessly cross-references signals a tired on-call would miss at 3am. Remediation — restart, scale, roll back, drain, modify config — is where a wrong action is the next incident. Most of the value of an SRE agent is captured by diagnosis alone. Ship the diagnosis agent first; treat remediation as a separate, later, gated product, not a feature flag you flip on the same loop.

STEP 2

Read-only is not the limited version; it is the trustworthy one.

The correct default is an agent whose entire toolset is observation: query metrics, tail logs, describe resources, read the deploy history. Read-only is not a crippled mode you upgrade away from — it is the configuration that lets the agent run unattended without a credible path to causing an incident. Earn write access narrowly and later, per action, with limits, never as a blanket grant.

# default toolset is observation-only; writes are a separate, gated grant
tools = [metrics_query, logs_tail, k8s_describe, deploy_history]
if mode == "remediate" and approved(action, blast_radius):
    tools += [scoped_rollback, scoped_restart]   # bounded, audited

A read-only SRE agent that posts a ranked hypothesis and the exact remediation command for a human to run captures most of the value with none of the blast radius — the human keeps the keys, the agent does the cross-referencing humans are bad at under pressure.

STEP 3

Blast radius is a property of the tool, bounded before the incident.

If remediation is in scope, the limit must live in the tool, not the prompt. A restart_service tool that can restart any service is unbounded; one scoped to a single namespace, rate-limited to N actions per window, and refusing changes outside a maintenance/severity context is bounded. The blast radius of an autonomous action is decided when you write the tool's signature, not when the agent calls it under incident pressure. Pair every write tool with a fail-closed kill switch the resume path also respects, exactly as for any production agent.

The catastrophic SRE failure is a confidently wrong remediation: an agent "fixes" a symptom by scaling down the wrong service or rolling back the wrong deploy, turning a partial outage into a full one — faster than a human could have. Confidence is not correctness; bound what a confident wrong action can touch.

STEP 4

Runbooks become tools, not prose the model reinterprets.

An agent that reads a wiki runbook and improvises the steps will eventually improvise wrong. The high-leverage move is to encode each runbook as a parameterized, tested tool — drain_node(node, reason), rollback_release(service, to_version) — with its own preconditions, dry-run, and post-checks baked in. The model's job is to select and parameterize the right runbook from a vetted library, not to author operational procedure live. A runbook-as-tool is reviewable, testable, and rate-limitable; a runbook-as-prose is a suggestion the model is free to misread.

STEP 5

Change control still applies — the operator being a model changes nothing.

An autonomous production change is still a production change: it goes through the same controls a human's would, because the audit, rollback, and accountability requirements do not care who initiated it. The eval signal and the operational guardrails for an SRE agent:

  • Diagnostic accuracy — on a bank of replayed past incidents, does the agent localize the real root cause, scored against the actual postmortem.
  • Remediation safety — on the same incidents, would its proposed action have helped or harmed; a harmful "fix" is weighted catastrophically.
  • Auditability — every action carries who/what/why, links to the triggering signal, and is reversible or has a tested compensating action.
  • Change-window respect — destructive actions refuse outside an incident/maintenance context, freeze windows are honored, and high-blast actions still page a human.
STEP 6

The DevOps & SRE checklist.

  • Diagnosis and remediation are separate products; ship diagnosis first, remediation later and gated.
  • Default toolset is observation-only; write access is per-action, scoped, rate-limited, and earned.
  • Blast radius is bounded in the tool signature, not the prompt; a fail-closed kill switch the resume path respects.
  • Runbooks are parameterized, tested tools selected from a vetted library — never prose the model reinterprets.
  • Autonomous changes pass the same change control, audit, and rollback requirements as human ones.
  • High-blast or out-of-window actions still page a human; freeze windows are enforced in the tool.
  • Eval is replayed past incidents scored on diagnostic accuracy and remediation safety, harmful fixes weighted catastrophically.

The honest tradeoff: full remediation autonomy is the dream and the trap — the agent that can end an incident in seconds is the same agent that can start one in seconds, so you buy speed only at the price of a blast radius you must bound before, not during, the incident.