What should you log for an AI agent in production?

What matters first

Log enough to answer five questions later:

what task the agent was trying to complete,
what path it took,
what tools or approvals it used,
what happened in the end,
and what it cost to get there.

If the team cannot reconstruct those five things, the logging layer is still too thin for production.

The minimum useful event model

At a minimum, production logs should capture:

a stable run ID,
workflow or task type,
actor or tenant scope,
model lane,
tool calls and outcomes,
approval requests and decisions,
final status,
latency,
cost,
and a compact failure reason when the run did not succeed.

That is the minimum base for debugging and evaluation.

What the log should explain

Healthy agent logging should tell you:

why a run started,
which branch or route it took,
whether retrieval or search happened,
whether a human touched the workflow,
whether any side effect was attempted,
and whether the result was useful, blocked, escalated, or rescued manually.

Logs are not only for engineers. They are for operators, reviewers, and incident owners too.

The fields most teams regret not having

Teams often discover too late that they failed to log:

approval reason,
tool arguments after normalization,
retry counts,
fallback path taken,
source or evidence set,
reviewer outcome,
and whether the run ended in silent abandonment instead of explicit failure.

Those missing fields make postmortems slow and eval design weak.

What not to log blindly

Do not turn logging into uncontrolled transcript hoarding.

Be careful with:

raw secrets,
tokens and credentials,
full customer payloads with no retention rule,
unnecessary personal data,
or entire tool outputs that are expensive, sensitive, or irrelevant to debugging.

The right approach is structured logging plus selective secure retention, not infinite storage of everything the model ever saw.

The best logging split

The healthiest production pattern usually separates logs into:

run metadata for routing, cost, and outcome,
trace events for debugging path and tool use,
approval events for human-control boundaries,
evaluation labels for pass, fail, or rescue,
and incident notes when something unusual happened.

That keeps one log stream from trying to do every job badly.

What good logging unlocks

When logging is strong, the team can:

audit approval behavior,
measure cost per successful task,
diagnose failures by class,
build regression datasets from real runs,
and decide where autonomy should expand or shrink.

Without structured logs, production quality decisions drift back into anecdote.

A simple retention rule

Keep high-value structured fields longer than bulky raw payloads.

If retention pressure appears, preserve:

identifiers,
workflow class,
tool traces,
approval decisions,
status codes,
evaluation labels,
and cost fields

before preserving every token of conversation history.

Implementation checklist

Your logging layer is probably healthy when:

every run has a stable ID and final status;
tool calls, approvals, and retries are explicit events;
outcome and rescue states are distinguishable;
cost and latency can be attached to workflow class;
and sensitive data is governed deliberately instead of logged by default.

Compare next

How do you evaluate AI agents in production? Use this page when the logging question is part of a larger production-evaluation design.

Tool-call success rates and ground truth Use this page when the team needs to separate tool success, workflow success, and final-answer quality.

What is a good success rate for an AI agent in production? Use this page when logging now needs to support a real success-rate definition rather than a vague pass metric.

Do AI agents need human approval in production? Use this page when your logging model needs to prove whether approval gates are actually changing outcomes.

Reader value check

This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For What should you log for an AI agent in production?, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

Check	What the reader should be able to answer
Signal quality	Can the team explain what behavior the signal proves, and what it does not prove?
Release use	Does the page help decide whether to ship, hold, roll back, or collect more evidence?
Failure learning	Does each miss become a reusable eval case instead of a one-off complaint?
Owner	Is there a named person or team responsible for maintaining the scorecard or review loop?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.