Skip to content

Trace grading for tool-using AI agents

Trace grading means evaluating the whole agent run:

  • what it planned,
  • which tools it chose,
  • how it used them,
  • where it escalated,
  • and whether the final outcome was acceptable.

If you only score the last answer, you miss the most expensive agent failures.

Tool-using agents can fail in ways that a final-text score hides:

  • wrong plan but lucky final answer,
  • right plan but wrong tool,
  • right tool with wrong arguments,
  • no approval when approval was required,
  • expensive or unnecessary tool use,
  • failure to stop when evidence was insufficient.

Those are system failures, not wording failures.

A practical trace-grading rubric should usually cover:

DimensionWhat to check
Plan qualityDid the agent choose a reasonable approach for the task?
Tool selectionWere the right tools used and the wrong ones avoided?
Tool argumentsWere inputs specific and correct enough to trust the call?
Approval behaviorDid the agent pause, escalate, or seek review when required?
Outcome qualityDid the final result solve the task acceptably?

This makes evaluation much closer to the real operating risk.

Trace grading is especially important when:

  • the agent can call multiple tools,
  • tool usage carries real cost,
  • approvals are part of the workflow,
  • or a bad decision can still produce a superficially plausible answer.

That is why it matters more as agents become more capable.

Ask:

  1. Did the agent understand what kind of task this was?
  2. Did it choose the correct tool path?
  3. Did it stop or escalate when uncertainty increased?
  4. Did it make unnecessary calls that increased spend without value?
  5. Did the trace reveal a repeatable failure class?

Good trace grading should explain why a run failed, not just that it failed.

Do not create a rubric so detailed that graders cannot use it consistently. A good trace rubric is:

  • tight,
  • repeatable,
  • and connected to real deployment risk.

Too many categories create noise. Too few hide system behavior.

Use output grading to decide whether the user-facing result is acceptable. Use trace grading to decide whether the agent behavior is safe, efficient, and governable. Teams that separate those two layers usually improve faster.

This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For Trace grading for tool-using AI agents, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

CheckWhat the reader should be able to answer
Signal qualityCan the team explain what behavior the signal proves, and what it does not prove?
Release useDoes the page help decide whether to ship, hold, roll back, or collect more evidence?
Failure learningDoes each miss become a reusable eval case instead of a one-off complaint?
OwnerIs there a named person or team responsible for maintaining the scorecard or review loop?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.