Skip to content

Ground truth collection and labeling for agent eval ops

The best evaluation dataset for an agent system usually starts in production-like work, not in synthetic benchmark prompts.

Ground truth should be collected from:

  • real user tasks,
  • representative internal tasks,
  • trace failures,
  • approval decisions,
  • and corrected outcomes produced by humans.

If the dataset does not reflect the real workflow, the eval program will measure the wrong thing very efficiently.

Agent systems create more labeling surface than simple text generation.

A single run may involve:

  • tool choice,
  • tool arguments,
  • retrieved evidence,
  • approval decisions,
  • retries,
  • partial failures,
  • and a final answer.

That means “label the answer” is often too shallow to train or evaluate the system properly.

Ground truth is not only the final correct answer.

Depending on the workflow, it may include:

  • the correct tool choice,
  • the acceptable set of tool choices,
  • the required evidence set,
  • the right approval outcome,
  • the correct stop condition,
  • and the minimum acceptable final artifact.

This is why agent evaluation datasets need structure, not just text labels.

The best sources are usually:

These show the actual work the system is being asked to do.

These reveal where the system drifted and what a correct recovery looked like.

These show where the system crossed a boundary that humans would not accept.

These capture edge cases that should stay in the regression set permanently.

Do not try to label everything.

Start with the labels that support the most useful decisions:

  • success or failure of the workflow,
  • failure type,
  • tool-choice correctness,
  • approval correctness,
  • and whether human intervention was required.

That is enough to build a meaningful evaluation loop before deeper annotation expands.

For many agent teams, a minimal taxonomy is:

  • correct outcome,
  • wrong tool,
  • wrong arguments,
  • insufficient evidence,
  • policy or approval failure,
  • timeout or incomplete execution,
  • unnecessary human escalation,
  • and hidden failure masked by a plausible final answer.

That taxonomy is much more actionable than one vague “bad answer” bucket.

Annotation programs become expensive when they chase completeness instead of leverage.

A healthier operating model is:

  • sample important workflows,
  • label failures more aggressively than successes,
  • preserve canonical difficult cases,
  • and refresh slices when the product meaningfully changes.

You do not need a perfect universal dataset. You need a living one that tracks business-critical failure.

Every eval example should justify its presence by doing at least one of these:

  • representing a common production task,
  • guarding a known expensive failure,
  • protecting an approval or security boundary,
  • or tracking a newly shipped capability.

If an example does none of those, it is probably annotation debt.