Tool-call success rates and ground truth for agent evals

Teams often say an agent “worked” because the final answer looked plausible. That is not enough once the system starts using search, retrieval, file access, browser control, or internal APIs. A tool-using workflow can fail long before the last answer. If evals only grade the answer, the team never learns whether the failure came from the wrong tool, the wrong arguments, the wrong sequence, or a weak approval decision.

What matters first

Tool-using agent evals should measure at least three layers:

tool-call success: did the tool run correctly and return the needed data?
workflow success: did the agent choose and sequence tools correctly?
task success: did the overall outcome satisfy the user need or business rule?

If those layers are collapsed into one pass/fail score, the eval system will hide the exact engineering work that needs to change.

What should count as tool-call success

A tool call should usually be graded successful only if all of these are true:

the tool selected was acceptable for the step;
required arguments were present and materially correct;
the call completed without invalid side effects;
the returned data was actually usable for the next step.

This matters because “HTTP 200” is not the same thing as workflow success.

The three ground-truth layers

Layer	What ground truth should describe
Tool layer	Which tool should have been used, with what argument expectations
Workflow layer	What sequence, approvals, retries, or fallbacks were acceptable
Outcome layer	What answer, state change, or artifact was the real desired result

Teams often have outcome labels but no tool-layer or workflow-layer truth. That makes diagnosis weak.

The failure taxonomy that matters

At minimum, split failures into these buckets:

wrong tool chosen;
correct tool, bad arguments;
correct tool and arguments, bad sequencing;
approval or permission failure;
good trace, weak synthesis or final answer;
infrastructure failure such as timeout or stale dependency.

This taxonomy matters because each bucket points to a different owner: prompting or policy, tool schema design, runtime reliability, or evaluation data quality.

What high-value eval sets look like

Strong eval sets for tool use usually include tasks with one clearly right tool choice, tasks with multiple plausible tools but one better path, tasks where the correct behavior is to refuse or escalate, tasks where tool output is noisy or incomplete, and tasks where retries should stop instead of continue.

If the eval set only covers happy-path calls, the team is measuring demo quality, not operating quality.

Next-step references

Agent evals for tool-using AI systems Use this page for the broader evaluation model around tool-connected agent systems.

Trace grading for tool-using agents Use this page when the team needs to inspect the full run instead of only the final answer.

Tool selection evals and failure taxonomy Use this page when the main problem is separating tool-choice mistakes from other agent failures.

Reader value check

This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For Tool-call success rates and ground truth for agent evals, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

Check	What the reader should be able to answer
Signal quality	Can the team explain what behavior the signal proves, and what it does not prove?
Release use	Does the page help decide whether to ship, hold, roll back, or collect more evidence?
Failure learning	Does each miss become a reusable eval case instead of a one-off complaint?
Owner	Is there a named person or team responsible for maintaining the scorecard or review loop?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.