Security
Your agent's inbox is an attack surface
Anyone can email your agent. That means anyone can try to inject instructions into it. We scan every inbound message for injection patterns, quarantine high-risk content, and use canary tokens to catch data exfiltration before it leaves.

From inbound to safe
Content sanitized on arrival
Invisible Unicode, data URIs, scripts, iframes, hidden elements: all stripped before the content is stored. Your agent never sees the raw attack surface.
Injection patterns scored
Four categories: instruction overrides, role play attempts, system prompt mimicry, delimiter abuse. Each match adds to a risk score.
High-risk messages quarantined
Score too high? The message body is replaced with a placeholder in the agent view. Operators can review the quarantined content with elevated permissions.
Thread anomalies flagged
Forged senders jumping into existing threads, intent flips from unknown addresses, rapid intent changes. These get flagged and (optionally) held for human review.
Canary tokens catch exfiltration
Every thread gets a canary token in the agent context. If an attacker tricks your agent into echoing that token in an outbound email, the send is blocked. The data never leaves.
Configure protection
PUT /v1/me/safety-settings
// Configure safety settings
PUT /v1/me/safety-settings
{
"quarantineHighInjection": true,
"holdCriticalAnomalies": true,
"blockCanaryViolations": true
}What we scan for
Instruction Override Detection
"Ignore previous instructions," "disregard all previous," "override your rules." These patterns are caught and scored before the content reaches your agent.
Role Play Detection
"You are now," "act as a," "pretend to be." Attempts to redefine agent behavior are flagged automatically.
System Prompt Mimicry
Fake system prompts using [INST], <<SYS>>, and other LLM control tokens embedded in email body text. We catch these too.
Content Quarantine
High-risk messages get quarantined. The agent sees a placeholder. An operator reviews the real content when they're ready.
Canary Tokens
Deterministic tokens per thread. Leaked in outbound content? Send blocked. This is how you catch data exfiltration in practice.
Thread Anomaly Detection
Forged thread injections, intent flips, rapid intent changes. Critical anomalies hold the message for human review before it reaches the agent.
Every message gets a risk score
The score is based on matched injection categories. High-risk messages are quarantined when protection is enabled (which it is, by default). You can tune thresholds, but the out-of-the-box settings are opinionated toward safety.
Risk Level Score Range Action none 0 No action low > 0, < 0.3 Logged, content delivered medium >= 0.3, < 0.7 Flagged for review high >= 0.7 Quarantined automatically
Protection is on from day one
Injection scanning, quarantine, canary tokens, thread anomaly detection. All enabled by default. Your agent is protected from the first inbound message.
Related features