Skip to content

Prompt injection defenses for tool-using agents

Prompt injection defense starts with architecture, not wording.

The minimum viable defense is:

  • treat tool outputs and retrieved content as untrusted;
  • restrict which tools the agent may call from untrusted contexts;
  • require approval before side-effecting actions;
  • and use explicit allowlists for browsing, execution, or system actions.

If the system relies mainly on “the model should ignore malicious instructions,” the defense is weak.

The most important rule is simple: tool outputs are data, not instructions. A browsed web page, retrieved chunk, support ticket, PDF, repository file, or third-party API response can contain text that looks like an instruction. The agent runtime has to preserve the authority boundary anyway.

That means the system should not let content returned by a tool decide:

  • which higher-privilege tool becomes available;
  • whether an approval gate is skipped;
  • whether secrets, customer data, or internal files are disclosed;
  • whether the agent can write, delete, purchase, publish, or message externally.

This is the practical meaning behind the common warning that tool outputs are untrusted. The warning is only useful when it becomes runtime design: narrow tools, explicit scopes, side-effect gates, and reviewable traces.

Tool-using agents now read:

  • web pages,
  • documents,
  • tickets,
  • code repositories,
  • and tool responses that may contain attacker-controlled text.

That means the model is no longer only interpreting user input. It is interpreting untrusted operational content that can try to redirect tool use or policy behavior.

Official sourceCurrent signalWhy it matters
Computer use guideOpenAI explicitly calls out prompt injection risk and recommends allowlists for expected websitesBrowser-facing agents need control-plane restrictions, not only prompt instructions
MCP authorization specificationAuthorization structure remains a separate layer around tool accessTool connectivity does not remove the need for strict permission and approval design
OpenAI Agents SDKGuardrails, tools, and handoffs are framework-level conceptsInjection defense has to be expressed at runtime and orchestration layers too

Prompt injection usually enters through:

  • web search results,
  • browsed pages,
  • uploaded files,
  • retrieved knowledge chunks,
  • tool output that includes attacker-controlled text.

The risk is not only bad prose. It is that the agent changes plan, tool choice, or action scope because it treated untrusted content as instructions.

System instructions, user instructions, and tool content should not be treated as the same authority.

Untrusted context should not unlock broad write-capable tools.

Any meaningful external side effect should require review or explicit confirmation.

Especially for browser and computer-use workflows, the safest design is to limit reachable domains or action classes.

Tools should be specific enough that even a manipulated plan has limited blast radius.

These are weak by themselves:

  • longer prompts that say “ignore malicious instructions”;
  • generic safety statements with no runtime enforcement;
  • broad tools with no approval layer;
  • or post hoc logging with no prevention.

They may help, but they do not meaningfully change the control boundary.

This page should help a reader decide which authority, data access, tool scope, and runtime boundary the agent system should receive. For Prompt injection defenses for tool-using agents, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring tool lists, auth scopes, sandbox limits, customer data classes, audit trails, and examples of unsafe tool output. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

CheckWhat the reader should be able to answer
AuthorityDoes the page distinguish advice, draft, write, delete, payment, and permission-changing actions?
IdentityIs it clear whether the agent acts as a user, service account, or constrained system role?
Runtime boundaryAre tools, network access, files, and secrets scoped to the smallest practical surface?
AuditabilityCan the team explain after the fact what the agent saw, decided, and changed?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For agent-system pages, the value is a safer architecture decision. The page should help readers reduce hidden authority before they add more tools or autonomy.