Trace grading for tool-using AI agents

What matters first

Trace grading means evaluating the whole agent run:

what it planned,
which tools it chose,
how it used them,
where it escalated,
and whether the final outcome was acceptable.

If you only score the last answer, you miss the most expensive agent failures.

Why output-only scoring is weak

Tool-using agents can fail in ways that a final-text score hides:

wrong plan but lucky final answer,
right plan but wrong tool,
right tool with wrong arguments,
no approval when approval was required,
expensive or unnecessary tool use,
failure to stop when evidence was insufficient.

Those are system failures, not wording failures.

What a trace should be graded on

A practical trace-grading rubric should usually cover:

Dimension	What to check
Plan quality	Did the agent choose a reasonable approach for the task?
Tool selection	Were the right tools used and the wrong ones avoided?
Tool arguments	Were inputs specific and correct enough to trust the call?
Approval behavior	Did the agent pause, escalate, or seek review when required?
Outcome quality	Did the final result solve the task acceptably?

This makes evaluation much closer to the real operating risk.

When trace grading matters most

Trace grading is especially important when:

the agent can call multiple tools,
tool usage carries real cost,
approvals are part of the workflow,
or a bad decision can still produce a superficially plausible answer.

That is why it matters more as agents become more capable.

The most useful trace questions

Ask:

Did the agent understand what kind of task this was?
Did it choose the correct tool path?
Did it stop or escalate when uncertainty increased?
Did it make unnecessary calls that increased spend without value?
Did the trace reveal a repeatable failure class?

Good trace grading should explain why a run failed, not just that it failed.

What to avoid

Do not create a rubric so detailed that graders cannot use it consistently. A good trace rubric is:

tight,
repeatable,
and connected to real deployment risk.

Too many categories create noise. Too few hide system behavior.

A strong operating model

Use output grading to decide whether the user-facing result is acceptable. Use trace grading to decide whether the agent behavior is safe, efficient, and governable. Teams that separate those two layers usually improve faster.

Compare next

Reader value check

This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For Trace grading for tool-using AI agents, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

Check	What the reader should be able to answer
Signal quality	Can the team explain what behavior the signal proves, and what it does not prove?
Release use	Does the page help decide whether to ship, hold, roll back, or collect more evidence?
Failure learning	Does each miss become a reusable eval case instead of a one-off complaint?
Owner	Is there a named person or team responsible for maintaining the scorecard or review loop?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.