Prompt injection defenses for tool-using agents
What matters first
Section titled “What matters first”Prompt injection defense starts with architecture, not wording.
The minimum viable defense is:
- treat tool outputs and retrieved content as untrusted;
- restrict which tools the agent may call from untrusted contexts;
- require approval before side-effecting actions;
- and use explicit allowlists for browsing, execution, or system actions.
If the system relies mainly on “the model should ignore malicious instructions,” the defense is weak.
Tool outputs are untrusted
Section titled “Tool outputs are untrusted”The most important rule is simple: tool outputs are data, not instructions. A browsed web page, retrieved chunk, support ticket, PDF, repository file, or third-party API response can contain text that looks like an instruction. The agent runtime has to preserve the authority boundary anyway.
That means the system should not let content returned by a tool decide:
- which higher-privilege tool becomes available;
- whether an approval gate is skipped;
- whether secrets, customer data, or internal files are disclosed;
- whether the agent can write, delete, purchase, publish, or message externally.
This is the practical meaning behind the common warning that tool outputs are untrusted. The warning is only useful when it becomes runtime design: narrow tools, explicit scopes, side-effect gates, and reviewable traces.
Why this matters now
Section titled “Why this matters now”Tool-using agents now read:
- web pages,
- documents,
- tickets,
- code repositories,
- and tool responses that may contain attacker-controlled text.
That means the model is no longer only interpreting user input. It is interpreting untrusted operational content that can try to redirect tool use or policy behavior.
Official signals checked April 15, 2026
Section titled “Official signals checked April 15, 2026”| Official source | Current signal | Why it matters |
|---|---|---|
| Computer use guide | OpenAI explicitly calls out prompt injection risk and recommends allowlists for expected websites | Browser-facing agents need control-plane restrictions, not only prompt instructions |
| MCP authorization specification | Authorization structure remains a separate layer around tool access | Tool connectivity does not remove the need for strict permission and approval design |
| OpenAI Agents SDK | Guardrails, tools, and handoffs are framework-level concepts | Injection defense has to be expressed at runtime and orchestration layers too |
Where injection actually enters
Section titled “Where injection actually enters”Prompt injection usually enters through:
- web search results,
- browsed pages,
- uploaded files,
- retrieved knowledge chunks,
- tool output that includes attacker-controlled text.
The risk is not only bad prose. It is that the agent changes plan, tool choice, or action scope because it treated untrusted content as instructions.
The strongest defenses
Section titled “The strongest defenses”1. Trust-boundary separation
Section titled “1. Trust-boundary separation”System instructions, user instructions, and tool content should not be treated as the same authority.
2. Tool restrictions
Section titled “2. Tool restrictions”Untrusted context should not unlock broad write-capable tools.
3. Approval gates
Section titled “3. Approval gates”Any meaningful external side effect should require review or explicit confirmation.
4. Allowlists
Section titled “4. Allowlists”Especially for browser and computer-use workflows, the safest design is to limit reachable domains or action classes.
5. Narrow action design
Section titled “5. Narrow action design”Tools should be specific enough that even a manipulated plan has limited blast radius.
What does not count as enough
Section titled “What does not count as enough”These are weak by themselves:
- longer prompts that say “ignore malicious instructions”;
- generic safety statements with no runtime enforcement;
- broad tools with no approval layer;
- or post hoc logging with no prevention.
They may help, but they do not meaningfully change the control boundary.
What to read next
Section titled “What to read next”- Tool outputs are untrusted: prompt injection boundary
- MCP security and approval boundaries for enterprise AI teams
- Computer Use API vs browser automation for AI agents
- User-scoped auth vs service accounts for AI agents
Reader value check
Section titled “Reader value check”This page should help a reader decide which authority, data access, tool scope, and runtime boundary the agent system should receive. For Prompt injection defenses for tool-using agents, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.
Before applying the guidance, bring tool lists, auth scopes, sandbox limits, customer data classes, audit trails, and examples of unsafe tool output. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.
| Check | What the reader should be able to answer |
|---|---|
| Authority | Does the page distinguish advice, draft, write, delete, payment, and permission-changing actions? |
| Identity | Is it clear whether the agent acts as a user, service account, or constrained system role? |
| Runtime boundary | Are tools, network access, files, and secrets scoped to the smallest practical surface? |
| Auditability | Can the team explain after the fact what the agent saw, decided, and changed? |
Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.
For agent-system pages, the value is a safer architecture decision. The page should help readers reduce hidden authority before they add more tools or autonomy.