What should you log for an AI agent in production?
What matters first
Section titled “What matters first”Log enough to answer five questions later:
- what task the agent was trying to complete,
- what path it took,
- what tools or approvals it used,
- what happened in the end,
- and what it cost to get there.
If the team cannot reconstruct those five things, the logging layer is still too thin for production.
The minimum useful event model
Section titled “The minimum useful event model”At a minimum, production logs should capture:
- a stable run ID,
- workflow or task type,
- actor or tenant scope,
- model lane,
- tool calls and outcomes,
- approval requests and decisions,
- final status,
- latency,
- cost,
- and a compact failure reason when the run did not succeed.
That is the minimum base for debugging and evaluation.
What the log should explain
Section titled “What the log should explain”Healthy agent logging should tell you:
- why a run started,
- which branch or route it took,
- whether retrieval or search happened,
- whether a human touched the workflow,
- whether any side effect was attempted,
- and whether the result was useful, blocked, escalated, or rescued manually.
Logs are not only for engineers. They are for operators, reviewers, and incident owners too.
The fields most teams regret not having
Section titled “The fields most teams regret not having”Teams often discover too late that they failed to log:
- approval reason,
- tool arguments after normalization,
- retry counts,
- fallback path taken,
- source or evidence set,
- reviewer outcome,
- and whether the run ended in silent abandonment instead of explicit failure.
Those missing fields make postmortems slow and eval design weak.
What not to log blindly
Section titled “What not to log blindly”Do not turn logging into uncontrolled transcript hoarding.
Be careful with:
- raw secrets,
- tokens and credentials,
- full customer payloads with no retention rule,
- unnecessary personal data,
- or entire tool outputs that are expensive, sensitive, or irrelevant to debugging.
The right approach is structured logging plus selective secure retention, not infinite storage of everything the model ever saw.
The best logging split
Section titled “The best logging split”The healthiest production pattern usually separates logs into:
- run metadata for routing, cost, and outcome,
- trace events for debugging path and tool use,
- approval events for human-control boundaries,
- evaluation labels for pass, fail, or rescue,
- and incident notes when something unusual happened.
That keeps one log stream from trying to do every job badly.
What good logging unlocks
Section titled “What good logging unlocks”When logging is strong, the team can:
- audit approval behavior,
- measure cost per successful task,
- diagnose failures by class,
- build regression datasets from real runs,
- and decide where autonomy should expand or shrink.
Without structured logs, production quality decisions drift back into anecdote.
A simple retention rule
Section titled “A simple retention rule”Keep high-value structured fields longer than bulky raw payloads.
If retention pressure appears, preserve:
- identifiers,
- workflow class,
- tool traces,
- approval decisions,
- status codes,
- evaluation labels,
- and cost fields
before preserving every token of conversation history.
Implementation checklist
Section titled “Implementation checklist”Your logging layer is probably healthy when:
- every run has a stable ID and final status;
- tool calls, approvals, and retries are explicit events;
- outcome and rescue states are distinguishable;
- cost and latency can be attached to workflow class;
- and sensitive data is governed deliberately instead of logged by default.
Compare next
Section titled “Compare next”Reader value check
Section titled “Reader value check”This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For What should you log for an AI agent in production?, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.
Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.
| Check | What the reader should be able to answer |
|---|---|
| Signal quality | Can the team explain what behavior the signal proves, and what it does not prove? |
| Release use | Does the page help decide whether to ship, hold, roll back, or collect more evidence? |
| Failure learning | Does each miss become a reusable eval case instead of a one-off complaint? |
| Owner | Is there a named person or team responsible for maintaining the scorecard or review loop? |
Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.
For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.