What breaks when agents can auto-send messages or links — how to defend outbound actions
# What breaks when agents can auto-send messages or links — how to defend outbound actions?
It breaks your assumption that “reading data” and “sending data” are separate, human-gated steps: once an agent can auto-send messages or embed links/images without per-action approval, outbound actions become a silent exfiltration channel that a few lines of indirect prompt injection can steer. Recent Copilot Cowork research shows that when delegated authority and automatic tool invocation are enabled, tiny payloads hidden in skills or documents can coerce the agent into reading tenant files and packaging them into outbound content that leaks when rendered or fetched.
Direct answer — what breaks and why
What breaks is the boundary between assistive automation and delegated autonomy. When an agent is allowed to post a Teams message, send an email “to the active user,” or embed remote resources without explicit, contextual approval, that outbound surface becomes equivalent to “network egress with tenant credentials.”
The specific failure mode is not that the model “decides to leak,” but that the agent runtime treats untrusted content (skills/docs/plugin metadata) as instructions, then executes high-impact actions under delegated authority. A small indirect injection can cause the agent to: (1) read internal files through connectors, and (2) transmit or trigger transmission via messages that contain links or externally hosted images. That combination turns routine collaboration features into a data-loss mechanism.
For a deeper treatment of the same class of failure (copilots that can send messages on your behalf), see: What breaks when a copilot can send messages for you — detecting and stopping file exfiltration.
How the attack works (technique primer)
The core technique is indirect prompt injection: the adversary places a short payload—public reports describe “only a few lines,” including a “five lines” example—inside content the agent will load and parse, such as a Copilot Skills file (e.g., a SKILL.md stored in OneDrive), a document, or plugin metadata.
From there, the chain looks like this (composite from public disclosures and guidance):
- The agent loads the poisoned skill/document as part of normal operation.
- The runtime mis-parses the injected text as high-priority directives.
- Through automatic tool invocation, the agent calls internal APIs/connectors to read additional tenant material (files, mailbox content) without a new human prompt.
- The agent crafts an outbound message that embeds:
- a link that resolves to a pre-authenticated download URL for tenant files, and/or
- an externally hosted image that “phones home” when the client renders it.
- When a recipient client (or subsequent automated processing) renders the message, it performs network requests that complete exfiltration—without a user explicitly approving either the read or the transfer.
The key mechanic is that the rendering/fetch step can move data (or authenticated URLs to data) outside your environment, even if your users never click anything.
Where defenses fail in real systems
Three systemic assumptions tend to collapse once outbound actions are auto-approved:
- “Skills are code, so they’re trusted.” The research brief explicitly treats skill files and plugin metadata as injection carriers. If your agent ingests them as plain text instructions, they must be treated as untrusted input.
- “Outbound messages are harmless.” A message that embeds remote images or pre-authenticated links can be a transport layer. Even if the content looks like “just a notification,” rendering can trigger network access.
- “Tool invocation is internal.” Automatic tool invocation turns the agent into an orchestrator that can chain “read → summarize → send” across multiple connectors, with no separate checkpoint.
Microsoft’s own guidance on defending indirect prompt injection frames this as a critical risk area requiring sandboxing, approvals, and network controls—because the vulnerability sits in the execution pipeline, not just model text generation.
Practical mitigations a solo builder can implement
Start by drawing a hard line around approvals for outbound actions that can carry tenant data.
- Per-action approvals for high-risk outbound operations
Require explicit confirmation for send-message, post-link, and any action that could initiate downloads or embed remote resources. Make the approval prompt contextual: what data sources were accessed, what recipients/domains are targeted, and what artifacts (links/images) will be included. Record who approved and when.
- Least privilege for agent tokens
Restrict readable scopes and avoid long-lived delegated tokens that work across broad content sets. Separate capabilities so “read internal” does not imply “share outbound.” This is the simplest way to reduce blast radius even if injection happens.
- Sandbox skill/plugin execution
Run third-party skills in isolated processes with no direct access to tenant tokens or file APIs. Expose a narrow façade API for file reads and message sends, and audit every call through it. The goal is to prevent untrusted skill content from directly steering privileged connectors.
- Network and content policies for outbound messages
Block agent-initiated outbound requests to arbitrary external hosts. Disallow or proxy remote images; strip or rewrite links through a vetted service that logs and scans requests. Since research demonstrates image/link rendering as the exfil trigger, controlling that surface is practical and directly relevant.
- Input handling: treat loaded skills/docs as untrusted
Sanitize and apply policy enforcement before the agent uses skill content as instructions. Heuristics can flag suspicious “prompt-like” patterns, ASCII smuggling, or conditional payload structures documented in public writeups—but enforce fail-closed behavior when detection triggers.
- Observability and audit trails
Log every tool call with: function name, caller context, and returned artifacts; include the exact prompt/context that triggered it; record approval events. Alert on bulk reads, repeated outbound sends, and first-seen external hosts.
These measures map directly onto both Microsoft’s mitigation guidance and third-party agent approval guidance (including OpenAI’s agent approvals security recommendations), and they’re implementable without needing vendor-level platform changes.
Design patterns that reduce attack surface
Use capability splitting as a design primitive: an agent that can read tenant files should not be able to share them outward without a second, explicit gate.
Two patterns follow from the Copilot Cowork-style exfil chain:
- Read vs. share separation: keep “message send/post” privileges in a different role/token from “file read.” Make the agent request an approval token (or a human-mediated workflow) to cross the boundary.
- Proxy + tokenization for links: if outbound content must include tenant material, generate short-lived, purpose-limited links via a proxy that enforces origin, scope, and expiry—and logs every fetch. This directly targets the “pre-authenticated download link” vector described in the research brief.
Observability playbook — what to monitor
If you can’t see it, you can’t prove an approval boundary held. Monitor egress and agent actions together:
- Tool invocations: function name, parameters, caller context, and artifacts returned from connectors (especially file APIs).
- Outbound endpoints: new or rare external domains; unusual volumes of remote fetches; outbound messages containing embedded images/links.
- Approval bypass attempts: any attempted
send/postthat conflicts with configured policy, including retries that vary recipients/domains.
This is also where many teams get tripped up operationally when automation scales; a related failure mode is noisy, low-signal agent output clogging systems of record (see What Breaks When Issue Trackers Fill Up With LLM‑Generated Bug Reports?).
Why It Matters Now
This moved from “theoretical” to “practical” with May 2026 disclosures highlighted in PromptArmor’s Copilot Cowork practitioner guidance and related reporting: researchers demonstrated exfiltration of Microsoft 365 files via indirect prompt injection in poisoned skills, using the agent’s ability to send messages without human approval and embedding external images that request pre-authenticated download links.
At the same time, enterprise assistants are being deployed with low-friction custom skills/plugins—exactly the ingestion path attackers need. Vendor mitigations exist, but the research brief is explicit that tenant and builder configuration determines real-world exposure: whether outbound actions are auto-approved, whether skills are sandboxed, and whether network egress is constrained.
What to Watch
- Updates to Copilot/Cowork approval models, sandboxing guidance, and indirect prompt injection mitigations from Microsoft.
- New research extending ASCII smuggling, conditional injections, or alternative exfil channels that don’t rely on obvious links/images.
- Product shifts in default posture (for example: outbound actions disabled or more tightly scoped by default) and improved platform-level observability/proxy options.
- Growing compliance scrutiny of delegated agent actions—especially where “automatic” outbound sends can constitute reportable data transfer.
Sources:
https://pulse24.ai/news/2026/5/25/23/microsoft-copilot-cowork-exfiltrates-m365-files
https://byteiota.com/microsoft-copilot-cowork-file-exfiltration/
https://learn.microsoft.com/en-us/security/zero-trust/sfi/defend-indirect-prompt-injection
https://developers.openai.com/codex/agent-approvals-security
About the Author
yrzhe
AI Product Thinker & Builder. Curating and analyzing tech news at TechScan AI. Follow @yrzhe_top on X for daily tech insights and commentary.