Security

Your agent's inbox is an attack surface

Anyone can email your agent. That means anyone can try to inject instructions into it. We scan every inbound message for injection patterns, quarantine high-risk content, and use canary tokens to catch data exfiltration before it leaves.

Inbound email being scanned for prompt injection patterns with detected fragments highlighted and a verdict badge

From inbound to safe

1

Content sanitized on arrival

Invisible Unicode, data URIs, scripts, iframes, hidden elements: all stripped before the content is stored. Your agent never sees the raw attack surface.

2

Injection patterns scored

Four categories: instruction overrides, role play attempts, system prompt mimicry, delimiter abuse. Each match adds to a risk score.

3

High-risk messages quarantined

Score too high? The message body is replaced with a placeholder in the agent view. Operators can review the quarantined content with elevated permissions.

4

Thread anomalies flagged

Forged senders jumping into existing threads, intent flips from unknown addresses, rapid intent changes. These get flagged and (optionally) held for human review.

5

Canary tokens catch exfiltration

Every thread gets a canary token in the agent context. If an attacker tricks your agent into echoing that token in an outbound email, the send is blocked. The data never leaves.

Configure protection

PUT /v1/me/safety-settings

// Configure safety settings
PUT /v1/me/safety-settings
{
  "quarantineHighInjection": true,
  "holdCriticalAnomalies": true,
  "blockCanaryViolations": true
}

What we scan for

Instruction Override Detection

"Ignore previous instructions," "disregard all previous," "override your rules." These patterns are caught and scored before the content reaches your agent.

Role Play Detection

"You are now," "act as a," "pretend to be." Attempts to redefine agent behavior are flagged automatically.

System Prompt Mimicry

Fake system prompts using [INST], <<SYS>>, and other LLM control tokens embedded in email body text. We catch these too.

Content Quarantine

High-risk messages get quarantined. The agent sees a placeholder. An operator reviews the real content when they're ready.

Canary Tokens

Deterministic tokens per thread. Leaked in outbound content? Send blocked. This is how you catch data exfiltration in practice.

Thread Anomaly Detection

Forged thread injections, intent flips, rapid intent changes. Critical anomalies hold the message for human review before it reaches the agent.

Every message gets a risk score

The score is based on matched injection categories. High-risk messages are quarantined when protection is enabled (which it is, by default). You can tune thresholds, but the out-of-the-box settings are opinionated toward safety.

Risk Level    Score Range    Action
none          0              No action
low           > 0, < 0.3    Logged, content delivered
medium        >= 0.3, < 0.7 Flagged for review
high          >= 0.7        Quarantined automatically

Protection is on from day one

Injection scanning, quarantine, canary tokens, thread anomaly detection. All enabled by default. Your agent is protected from the first inbound message.