Skip to content

Tool-call success rates and ground truth for agent evals

Teams often say an agent “worked” because the final answer looked plausible. That is not enough once the system starts using search, retrieval, file access, browser control, or internal APIs. A tool-using workflow can fail long before the last answer. If evals only grade the answer, the team never learns whether the failure came from the wrong tool, the wrong arguments, the wrong sequence, or a weak approval decision.

Tool-using agent evals should measure at least three layers:

  1. tool-call success: did the tool run correctly and return the needed data?
  2. workflow success: did the agent choose and sequence tools correctly?
  3. task success: did the overall outcome satisfy the user need or business rule?

If those layers are collapsed into one pass/fail score, the eval system will hide the exact engineering work that needs to change.

A tool call should usually be graded successful only if all of these are true:

  • the tool selected was acceptable for the step;
  • required arguments were present and materially correct;
  • the call completed without invalid side effects;
  • the returned data was actually usable for the next step.

This matters because “HTTP 200” is not the same thing as workflow success.

LayerWhat ground truth should describe
Tool layerWhich tool should have been used, with what argument expectations
Workflow layerWhat sequence, approvals, retries, or fallbacks were acceptable
Outcome layerWhat answer, state change, or artifact was the real desired result

Teams often have outcome labels but no tool-layer or workflow-layer truth. That makes diagnosis weak.

At minimum, split failures into these buckets:

  • wrong tool chosen;
  • correct tool, bad arguments;
  • correct tool and arguments, bad sequencing;
  • approval or permission failure;
  • good trace, weak synthesis or final answer;
  • infrastructure failure such as timeout or stale dependency.

This taxonomy matters because each bucket points to a different owner: prompting or policy, tool schema design, runtime reliability, or evaluation data quality.

Strong eval sets for tool use usually include tasks with one clearly right tool choice, tasks with multiple plausible tools but one better path, tasks where the correct behavior is to refuse or escalate, tasks where tool output is noisy or incomplete, and tasks where retries should stop instead of continue.

If the eval set only covers happy-path calls, the team is measuring demo quality, not operating quality.

This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For Tool-call success rates and ground truth for agent evals, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

CheckWhat the reader should be able to answer
Signal qualityCan the team explain what behavior the signal proves, and what it does not prove?
Release useDoes the page help decide whether to ship, hold, roll back, or collect more evidence?
Failure learningDoes each miss become a reusable eval case instead of a one-off complaint?
OwnerIs there a named person or team responsible for maintaining the scorecard or review loop?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.