Skip to content

How to review AI agent production incidents

An AI agent incident review should produce operating changes, not just a story.

The review should answer:

  • what failed,
  • why existing controls did not stop it,
  • what evidence was missing,
  • which eval should have caught it,
  • which alert should have surfaced it,
  • and what release or approval rule changes before the next rollout.

If the review ends with “the model made a bad choice” and nothing changes, the team has not learned enough.

The weak model is:

“The agent hallucinated. We improved the prompt.”

Sometimes that is true. Often it is incomplete.

Incidents may come from:

  • wrong tool selection,
  • weak retrieval,
  • missing approval boundary,
  • unclear workflow ownership,
  • prompt drift,
  • model-route change,
  • untested edge case,
  • or downstream system behavior.

The review should classify the system failure, not only the output failure.

Capture:

  • incident ID,
  • date and detection source,
  • affected workflow,
  • severity,
  • run IDs,
  • agent version,
  • model lane,
  • tool configuration,
  • approval policy,
  • release or configuration changes,
  • customer or operator impact,
  • containment action,
  • and final corrective actions.

This is the evidence base for improving the operating system.

Every incident should receive one primary failure class and optional secondary classes.

Useful classes include:

  • instruction failure,
  • tool selection failure,
  • tool argument failure,
  • retrieval or evidence failure,
  • approval boundary failure,
  • workflow routing failure,
  • escalation failure,
  • cost-control failure,
  • latency or timeout failure,
  • and release-process failure.

The taxonomy matters because each class has a different fix.

The trigger is what started the incident.

Examples:

  • new prompt version,
  • changed tool schema,
  • model lane switch,
  • updated retrieval corpus,
  • larger customer workload,
  • or unusual user input.

The failed control is what should have contained it.

Examples:

  • eval gap,
  • missing canary,
  • weak approval policy,
  • absent alert,
  • incomplete logging,
  • no fallback lane,
  • or unclear incident owner.

Good reviews separate these two. Otherwise the team fixes the trigger and misses the control gap.

For every serious incident, decide whether to add:

  • the exact run to a regression set,
  • a simplified synthetic version to a release gate,
  • a tool-selection test,
  • an approval-boundary test,
  • a retrieval evidence case,
  • or a reviewer-training example.

The point is not to overfit to one incident. The point is to protect the class of failure.

Ask:

  • Was the incident detected by a user, operator, dashboard, alert, or review queue?
  • Should it have been urgent?
  • Which metric or event changed first?
  • Did the alert include enough examples to act?
  • Was the owner obvious?

If the answer is no, update alert design or review sampling.

Incidents often reveal weak rollout discipline.

Possible release changes:

  • new canary threshold,
  • stricter approval for one action class,
  • mandatory eval pass before release,
  • release notes for model-route changes,
  • stronger rollback metadata,
  • or a freeze rule after repeated failures.

The review should specify which release gate changes and who owns it.

Keep the meeting tight:

  1. Facts and timeline.
  2. Impact and severity.
  3. Failure taxonomy.
  4. Missed controls.
  5. Evidence gaps.
  6. Eval, alert, release, and ownership changes.

Avoid long speculation about model personality. Focus on observable system behavior.

Your review process is probably healthy when:

  • every serious incident gets a failure class;
  • trigger and failed control are separated;
  • missing evidence becomes a logging or tracing task;
  • representative examples enter eval or reviewer workflows;
  • alert thresholds or review sampling improve;
  • and release gates change when rollout discipline failed.

This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For How to review AI agent production incidents, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

CheckWhat the reader should be able to answer
Signal qualityCan the team explain what behavior the signal proves, and what it does not prove?
Release useDoes the page help decide whether to ship, hold, roll back, or collect more evidence?
Failure learningDoes each miss become a reusable eval case instead of a one-off complaint?
OwnerIs there a named person or team responsible for maintaining the scorecard or review loop?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.