Cost per success and tool economics for agentic products
Quick answer
Section titled “Quick answer”The real cost metric for an agentic product is usually cost per successful task, not cost per model call.
Per-call pricing is still necessary, but it is not enough. If a workflow uses search, retrieval, code execution, browsing, or multi-step orchestration, the team needs to ask:
- how often the task succeeds,
- how often it succeeds without human cleanup,
- and what the whole workflow costs when it does.
That is the only level where tool use can be judged honestly.
Why per-call thinking breaks down
Section titled “Why per-call thinking breaks down”Per-call thinking fails because agentic systems do not sell API calls. They sell completed outcomes.
A product can have:
- a cheap average call cost but terrible completion rate,
- a low model bill but a heavy managed-tools bill,
- or a polished agent loop that still loses money once retries, human review, and failure cleanup are included.
This is why a lower token bill can still hide a worse business.
Current official signals checked April 15, 2026
Section titled “Current official signals checked April 15, 2026”| Source | Current signal | Why it matters |
|---|---|---|
| OpenAI API pricing | Model pricing is only one part of runtime economics | Tool-connected systems need economics above the token layer |
| OpenAI API priority processing | Service tier changes can materially affect runtime spend | Product economics now include routing and service-tier choices |
| OpenAI tools guides | Search, file access, and execution are first-class workflow primitives | Tool calls should be budgeted as workflow decisions, not hidden implementation details |
The better metric stack
Section titled “The better metric stack”For most serious products, track these together:
- Cost per request
- Cost per attempted task
- Cost per successful task
- Cost per accepted result
- Cost per retained user workflow
Only the first number is easy. The third and fourth are where product truth usually appears.
What counts as success
Section titled “What counts as success”Do not define success as “the model answered.”
Success should usually mean:
- the correct workflow completed,
- required evidence or tool output was gathered,
- the result met policy requirements,
- and the user did not need to redo or manually repair the work.
Anything weaker inflates apparent efficiency.
A practical cost-per-success formula
Section titled “A practical cost-per-success formula”Use a simple version first:
total workflow cost / successful workflow completions
Include:
- model calls,
- tool calls,
- retries,
- background runs,
- review overhead,
- and known cleanup or fallback cost when the workflow fails.
Even a rough version of this metric is more useful than a precise token spreadsheet that ignores human rework.
Where tool economics usually go wrong
Section titled “Where tool economics usually go wrong”Search everywhere
Section titled “Search everywhere”Teams enable search on every request, even when the task is stable or closed-world. That improves demos but inflates cost per success on routine work.
Retrieval without evidence of uplift
Section titled “Retrieval without evidence of uplift”Managed file search or external retrieval is added because it feels architecturally mature, not because it improves successful completion enough to justify the stack.
Execution without product necessity
Section titled “Execution without product necessity”Execution tools are often used because they make answers look serious. If the user value barely changes, the runtime cost is simply product drag.
Retry loops disguised as resilience
Section titled “Retry loops disguised as resilience”Retries, fallbacks, and multiple tool paths can rescue some tasks, but they also raise the cost of every eventual success. The product must decide which rescues are worth buying.
The operating rule that works
Section titled “The operating rule that works”Every major tool path should answer three questions:
- What does this tool increase?
- What does it cost?
- Which user or business outcome justifies the difference?
If the team cannot answer all three, the tool is probably being paid for on hope rather than evidence.
A useful experiment design
Section titled “A useful experiment design”For a target workflow, compare:
- model-only baseline,
- minimal-tool version,
- full-tool version.
Measure:
- latency,
- total workflow cost,
- success rate,
- accepted-result rate,
- and need for human intervention.
This usually reveals whether tools are increasing real success or only expensive sophistication.
The product-level implication
Section titled “The product-level implication”The right decision is not always “cheaper.”
Sometimes a more expensive tool path is justified because it:
- increases completion rate enough to lower cost per accepted result,
- cuts rework sharply,
- or raises quality enough to support a higher-value use case.
But teams need to prove that at the workflow level, not assume it from model or tool marketing.