Skip to content

How should AI teams sample live traffic for agent evals?

The usual failure pattern is simple: teams say they are evaluating live traffic, but what they are really evaluating is a thin random slice of easy requests. That produces good-looking dashboards and weak operational truth. A healthy sampling strategy does not start with percentages. It starts with risk classes, expensive failure modes, and the slices of traffic that most reliably break the system.

Random sampling is useful, but it should never be the only sampling model. A live eval program normally needs three layers:

  • baseline random sampling to catch general drift,
  • risk-weighted sampling for workflows where failure is expensive,
  • always-review slices for policy-heavy, approval-heavy, or customer-sensitive cases.

If the team only samples at random, it will almost always under-sample the traffic that matters most.

Offline regression is essential, but it cannot show everything that changes once a system meets real users, real documents, and real tool behavior. Live sampling is where teams catch:

  • changed user prompts,
  • ambiguous real-world inputs,
  • new failure patterns in retrieval or tools,
  • and the support burden created by systems that still look healthy offline.

That is why live sampling belongs to EvalOps, not only analytics.

At minimum, live sampling should deliberately cover:

  • the highest-value workflow types,
  • cases that cross approval or escalation boundaries,
  • tasks with tool side effects,
  • slices known to be historically brittle,
  • and a random slice of ordinary traffic for general drift detection.

If the sample is built only from convenience or low-cost review, it becomes a comfort exercise.

Some traffic should not depend on sampling at all. Teams should review every case, or near every case, when:

  • the task can trigger a consequential action,
  • the user is high-value or high-risk,
  • the system is newly rolled out,
  • the slice has known instability,
  • or policy/compliance exposure is meaningful.

Sampling is useful. It is not a replacement for judgment about where review is structurally necessary.

Most sampling mistakes are not statistical mistakes. They are capacity mistakes.

Teams often choose the sample they can afford to review instead of the sample they need to understand. The fix is not only to add reviewers. It is to stratify the work:

  • lightweight automated checks on all traffic,
  • sampled human review on medium-risk slices,
  • mandatory review on high-risk slices.

That is how review scales without becoming blind.

Live sampling does not replace full regression. Full regression still matters when:

  • a major model, tool, or routing change is shipping,
  • the product is widening rollout,
  • a high-risk workflow changed,
  • or the team needs to prove the system is still safe on known critical examples.

Use full regression to protect the known edge set. Use live sampling to catch the unknown edge set.

For many teams, a healthy weekly operating model looks like this:

  • random sample from ordinary traffic,
  • targeted sample from each high-value workflow,
  • full review of high-risk actions or approvals,
  • trigger-based review when alerts, cost drift, or escalation anomalies appear.

The exact percentages matter less than whether the slices represent real operational risk.

The common signs are:

  • sampling only low-risk traffic because it is faster,
  • not separating by workflow type,
  • mixing reviewer burden with quality signals,
  • or waiting for complaints before increasing review depth.

Those teams usually believe the product is healthier than it is.

This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For How should AI teams sample live traffic for agent evals?, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

CheckWhat the reader should be able to answer
Signal qualityCan the team explain what behavior the signal proves, and what it does not prove?
Release useDoes the page help decide whether to ship, hold, roll back, or collect more evidence?
Failure learningDoes each miss become a reusable eval case instead of a one-off complaint?
OwnerIs there a named person or team responsible for maintaining the scorecard or review loop?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.