How should AI teams sample live traffic for agent evals?
The usual failure pattern is simple: teams say they are evaluating live traffic, but what they are really evaluating is a thin random slice of easy requests. That produces good-looking dashboards and weak operational truth. A healthy sampling strategy does not start with percentages. It starts with risk classes, expensive failure modes, and the slices of traffic that most reliably break the system.
What matters first
Section titled “What matters first”Random sampling is useful, but it should never be the only sampling model. A live eval program normally needs three layers:
- baseline random sampling to catch general drift,
- risk-weighted sampling for workflows where failure is expensive,
- always-review slices for policy-heavy, approval-heavy, or customer-sensitive cases.
If the team only samples at random, it will almost always under-sample the traffic that matters most.
Why live sampling exists at all
Section titled “Why live sampling exists at all”Offline regression is essential, but it cannot show everything that changes once a system meets real users, real documents, and real tool behavior. Live sampling is where teams catch:
- changed user prompts,
- ambiguous real-world inputs,
- new failure patterns in retrieval or tools,
- and the support burden created by systems that still look healthy offline.
That is why live sampling belongs to EvalOps, not only analytics.
What should always be in the sample pool
Section titled “What should always be in the sample pool”At minimum, live sampling should deliberately cover:
- the highest-value workflow types,
- cases that cross approval or escalation boundaries,
- tasks with tool side effects,
- slices known to be historically brittle,
- and a random slice of ordinary traffic for general drift detection.
If the sample is built only from convenience or low-cost review, it becomes a comfort exercise.
What should be always-review traffic
Section titled “What should be always-review traffic”Some traffic should not depend on sampling at all. Teams should review every case, or near every case, when:
- the task can trigger a consequential action,
- the user is high-value or high-risk,
- the system is newly rolled out,
- the slice has known instability,
- or policy/compliance exposure is meaningful.
Sampling is useful. It is not a replacement for judgment about where review is structurally necessary.
The real tradeoff is reviewer capacity
Section titled “The real tradeoff is reviewer capacity”Most sampling mistakes are not statistical mistakes. They are capacity mistakes.
Teams often choose the sample they can afford to review instead of the sample they need to understand. The fix is not only to add reviewers. It is to stratify the work:
- lightweight automated checks on all traffic,
- sampled human review on medium-risk slices,
- mandatory review on high-risk slices.
That is how review scales without becoming blind.
When full regression still matters
Section titled “When full regression still matters”Live sampling does not replace full regression. Full regression still matters when:
- a major model, tool, or routing change is shipping,
- the product is widening rollout,
- a high-risk workflow changed,
- or the team needs to prove the system is still safe on known critical examples.
Use full regression to protect the known edge set. Use live sampling to catch the unknown edge set.
A practical sampling model
Section titled “A practical sampling model”For many teams, a healthy weekly operating model looks like this:
- random sample from ordinary traffic,
- targeted sample from each high-value workflow,
- full review of high-risk actions or approvals,
- trigger-based review when alerts, cost drift, or escalation anomalies appear.
The exact percentages matter less than whether the slices represent real operational risk.
What weak live sampling looks like
Section titled “What weak live sampling looks like”The common signs are:
- sampling only low-risk traffic because it is faster,
- not separating by workflow type,
- mixing reviewer burden with quality signals,
- or waiting for complaints before increasing review depth.
Those teams usually believe the product is healthier than it is.
Compare next
Section titled “Compare next”Reader value check
Section titled “Reader value check”This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For How should AI teams sample live traffic for agent evals?, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.
Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.
| Check | What the reader should be able to answer |
|---|---|
| Signal quality | Can the team explain what behavior the signal proves, and what it does not prove? |
| Release use | Does the page help decide whether to ship, hold, roll back, or collect more evidence? |
| Failure learning | Does each miss become a reusable eval case instead of a one-off complaint? |
| Owner | Is there a named person or team responsible for maintaining the scorecard or review loop? |
Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.
For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.