How should AI teams sample live traffic for agent evals?

The usual failure pattern is simple: teams say they are evaluating live traffic, but what they are really evaluating is a thin random slice of easy requests. That produces good-looking dashboards and weak operational truth. A healthy sampling strategy does not start with percentages. It starts with risk classes, expensive failure modes, and the slices of traffic that most reliably break the system.

What matters first

Random sampling is useful, but it should never be the only sampling model. A live eval program normally needs three layers:

baseline random sampling to catch general drift,
risk-weighted sampling for workflows where failure is expensive,
always-review slices for policy-heavy, approval-heavy, or customer-sensitive cases.

If the team only samples at random, it will almost always under-sample the traffic that matters most.

Why live sampling exists at all

Offline regression is essential, but it cannot show everything that changes once a system meets real users, real documents, and real tool behavior. Live sampling is where teams catch:

changed user prompts,
ambiguous real-world inputs,
new failure patterns in retrieval or tools,
and the support burden created by systems that still look healthy offline.

That is why live sampling belongs to EvalOps, not only analytics.

What should always be in the sample pool

At minimum, live sampling should deliberately cover:

the highest-value workflow types,
cases that cross approval or escalation boundaries,
tasks with tool side effects,
slices known to be historically brittle,
and a random slice of ordinary traffic for general drift detection.

If the sample is built only from convenience or low-cost review, it becomes a comfort exercise.

What should be always-review traffic

Some traffic should not depend on sampling at all. Teams should review every case, or near every case, when:

the task can trigger a consequential action,
the user is high-value or high-risk,
the system is newly rolled out,
the slice has known instability,
or policy/compliance exposure is meaningful.

Sampling is useful. It is not a replacement for judgment about where review is structurally necessary.

The real tradeoff is reviewer capacity

Most sampling mistakes are not statistical mistakes. They are capacity mistakes.

Teams often choose the sample they can afford to review instead of the sample they need to understand. The fix is not only to add reviewers. It is to stratify the work:

lightweight automated checks on all traffic,
sampled human review on medium-risk slices,
mandatory review on high-risk slices.

That is how review scales without becoming blind.

When full regression still matters

Live sampling does not replace full regression. Full regression still matters when:

a major model, tool, or routing change is shipping,
the product is widening rollout,
a high-risk workflow changed,
or the team needs to prove the system is still safe on known critical examples.

Use full regression to protect the known edge set. Use live sampling to catch the unknown edge set.

A practical sampling model

For many teams, a healthy weekly operating model looks like this:

random sample from ordinary traffic,
targeted sample from each high-value workflow,
full review of high-risk actions or approvals,
trigger-based review when alerts, cost drift, or escalation anomalies appear.

The exact percentages matter less than whether the slices represent real operational risk.

What weak live sampling looks like

The common signs are:

sampling only low-risk traffic because it is faster,
not separating by workflow type,
mixing reviewer burden with quality signals,
or waiting for complaints before increasing review depth.

Those teams usually believe the product is healthier than it is.

Compare next

Shadow evals and canary rollouts Use this page when live sampling is part of a staged rollout rather than steady-state monitoring.

What is EvalOps for AI teams? Use this page when the broader operating model for datasets, traces, scorecards, and release gates still needs to be clarified.

What should an agent eval scorecard actually measure? Use this page when sampling is clear but the team still has not defined what the live review should score.

Traces vs logs for agent eval ops Use this page when sampled live traffic needs better run-level evidence rather than only aggregate logs.

Reader value check

This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For How should AI teams sample live traffic for agent evals?, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

Check	What the reader should be able to answer
Signal quality	Can the team explain what behavior the signal proves, and what it does not prove?
Release use	Does the page help decide whether to ship, hold, roll back, or collect more evidence?
Failure learning	Does each miss become a reusable eval case instead of a one-off complaint?
Owner	Is there a named person or team responsible for maintaining the scorecard or review loop?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.