Tool-call success rates and ground truth for agent evals
Teams often say an agent “worked” because the final answer looked plausible. That is not enough once the system starts using search, retrieval, file access, browser control, or internal APIs. A tool-using workflow can fail long before the last answer. If evals only grade the answer, the team never learns whether the failure came from the wrong tool, the wrong arguments, the wrong sequence, or a weak approval decision.
What matters first
Section titled “What matters first”Tool-using agent evals should measure at least three layers:
- tool-call success: did the tool run correctly and return the needed data?
- workflow success: did the agent choose and sequence tools correctly?
- task success: did the overall outcome satisfy the user need or business rule?
If those layers are collapsed into one pass/fail score, the eval system will hide the exact engineering work that needs to change.
What should count as tool-call success
Section titled “What should count as tool-call success”A tool call should usually be graded successful only if all of these are true:
- the tool selected was acceptable for the step;
- required arguments were present and materially correct;
- the call completed without invalid side effects;
- the returned data was actually usable for the next step.
This matters because “HTTP 200” is not the same thing as workflow success.
The three ground-truth layers
Section titled “The three ground-truth layers”| Layer | What ground truth should describe |
|---|---|
| Tool layer | Which tool should have been used, with what argument expectations |
| Workflow layer | What sequence, approvals, retries, or fallbacks were acceptable |
| Outcome layer | What answer, state change, or artifact was the real desired result |
Teams often have outcome labels but no tool-layer or workflow-layer truth. That makes diagnosis weak.
The failure taxonomy that matters
Section titled “The failure taxonomy that matters”At minimum, split failures into these buckets:
- wrong tool chosen;
- correct tool, bad arguments;
- correct tool and arguments, bad sequencing;
- approval or permission failure;
- good trace, weak synthesis or final answer;
- infrastructure failure such as timeout or stale dependency.
This taxonomy matters because each bucket points to a different owner: prompting or policy, tool schema design, runtime reliability, or evaluation data quality.
What high-value eval sets look like
Section titled “What high-value eval sets look like”Strong eval sets for tool use usually include tasks with one clearly right tool choice, tasks with multiple plausible tools but one better path, tasks where the correct behavior is to refuse or escalate, tasks where tool output is noisy or incomplete, and tasks where retries should stop instead of continue.
If the eval set only covers happy-path calls, the team is measuring demo quality, not operating quality.
Next-step references
Section titled “Next-step references”Reader value check
Section titled “Reader value check”This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For Tool-call success rates and ground truth for agent evals, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.
Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.
| Check | What the reader should be able to answer |
|---|---|
| Signal quality | Can the team explain what behavior the signal proves, and what it does not prove? |
| Release use | Does the page help decide whether to ship, hold, roll back, or collect more evidence? |
| Failure learning | Does each miss become a reusable eval case instead of a one-off complaint? |
| Owner | Is there a named person or team responsible for maintaining the scorecard or review loop? |
Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.
For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.