IT operations / managed services · Global· PRODUCTION

Runbook execution and on-call digest for a managed-services team.

Hub composition: Tools and Background heavy across /it-ops.

THE SITUATION

What the team was already living with.

A managed-services team running a defined set of services across multiple customers. Runbooks lived in three different drives. On-call shifts started with a half-hour archaeology dig before any real work could begin.

Leadership wanted runbook execution to be auditable, on-call hand-overs to compress, and triage decisions to be visible after the fact.

WHAT WE BUILT

The agents that shipped.

  • 01

    /it-ops Tools in Claude Code. Executes named runbooks with parameters; logs the run with the policy applied and the steps taken.

  • 02

    /it-ops Q&A on Slack. Answers questions about service health, recent incidents, and ownership.

  • 03

    /it-ops Background. A morning digest summarising the overnight: what fired, what was suppressed, what is open at hand-over.

A WORKING EXCHANGE

Real questions. Real answers.

$ /it-ops run failover-drill on cluster-east runbook → Loaded failover-drill v3. Pre-checks pass. Step 1/6: drained traffic. Step 2/6: promoted standby. Health check 200. Run logged as DRL-2904. $ /it-ops what changed overnight digest → 2 alerts fired, both auto-suppressed per policy. 1 ticket open at hand-over: customer cluster-west, latency spike 0240-0312 UTC, currently green.
THE OUTCOME

What changed, concretely.

  • Outcome 01

    Runbook auditability

    Every runbook execution carries a logged record of the steps taken and the policy applied.

  • Outcome 02

    On-call hand-over time

    Compressed from a manual archaeology dig to a generated digest at the start of each shift. [TBD: average minutes saved per shift.]

  • Outcome 03

    Triage visibility

    Triage decisions are reviewable after the fact; alerts that were auto-suppressed are visible in the digest with the policy that suppressed them.