Security

Email Prompt Injection Attacks

When an AI agent processes incoming email, every message becomes a potential attack vector. A malicious sender crafts text that hijacks the agent's behavior: overriding instructions, leaking data, triggering actions the operator never intended. This is prompt injection, and it's the defining security challenge for agentic email.

Below are the 10 major attack categories. Each page walks through how the attack works, shows real examples, and covers defenses. Worth reading even if you think your agent is safe (especially then).

Radial diagram showing 10 prompt injection attack categories with defense indicators

Instruction Override"Ignore all previous instructions." The most common attack, and still effective.Role PlayPersona hijacking. The attacker forces the AI into a new identity to sidestep safety controls.System Prompt MimicryFake [SYSTEM] tokens and control directives that look authoritative to the model.Delimiter AbuseCode fences, XML tags, separators. Anything that makes injected instructions look structural.Data ExfiltrationTricking the agent into leaking system prompts, training data, or internal state.Encoding EvasionBase64, hex, rot13. If the model can decode it, attackers will use it to hide payloads.Context ManipulationRewriting the conversation context so the agent thinks it's in a different situation entirely.Multi-Language AttacksMid-text language switches that slip past English-only detection rules.Social EngineeringUrgency, guilt, flattery. The same tricks that work on humans work on language models too.Token SmugglingZero-width characters, homoglyphs, unicode tricks. Invisible to humans, visible to tokenizers.

Why email is different

Chat interfaces have a human in the loop. Email agents don't. They forward messages, update CRMs, schedule meetings, draft replies, all autonomously. A successful injection can make the agent leak confidential data by forwarding it to the attacker, impersonate the account holder by sending replies they never wrote, or corrupt business workflows with fabricated data. The agent follows instructions. It can't tell whose.

No single defense works. You need layers: input scanning, output policy, decision traces, and guardrails that assume every inbound message is hostile.