Multi-Provider Failover: Never Depend on One ESP
2026-04-13
Every major email delivery provider has had a significant outage in the last two years. Resend, Postmark, SendGrid, Amazon SES - they all go down. The incidents range from a few minutes of degraded delivery to multi-hour full outages where emails queue and fail.
For human-authored email, a short outage is annoying. You delay a newsletter by an hour and nobody notices. For AI agents, the failure mode is different. Agents react to events. A trial signup triggers an onboarding sequence. A payment fails and triggers a dunning flow. A support request triggers a confirmation. If delivery is down when those events happen, the sends either fail silently or back up in a queue that may not drain in the right order when the provider recovers.
The solution is straightforward: don't route all your sends through a single provider.
What failover actually means in practice
Failover is not the same as having a backup provider you switch to manually when things break. Manual failover requires someone to notice the outage, decide to switch, update configuration, and verify the new routing is working. By the time that happens, your agent has already missed the sends that matter.
Automatic failover means the delivery layer tries another provider the moment one fails, within the same send request. The agent calls POST /send. The infrastructure evaluates which provider to use, tries the primary, and if it fails, tries the next one. The agent gets a success response. Nothing failed from its perspective.
The distinction matters for agents specifically because agents don't have someone watching a dashboard. A human operator notices when an important email didn't go out. An agent's only feedback is the response it gets from the send API. If that response is "sent" but the send actually failed silently somewhere downstream, the agent moves on without triggering any recovery logic.
How Molted's failover router works
The delivery layer in Molted uses a weighted failover router. At a high level, it works like this:
- Providers are ordered by weight (configured priority). The highest-weight provider is tried first.
- If a send succeeds, the failure counter for that provider resets and the result is returned.
- If a send fails and the error is retryable (provider timeout, 5xx, connection error), the router tries the next provider in the ordered list.
- If the error is not retryable (invalid recipient, malformed payload), the router stops immediately and returns the error. There's no point trying another provider for a bad recipient address.
- If all providers fail, the router returns
all_providers_failedand the send is queued for retry.
The consecutive failure tracking is per-provider, not per-request. A provider that has hit the failure threshold gets moved to the end of the ordered list but stays available as a last resort. This prevents a degraded provider from being treated as completely dead when it might be intermittently available.
Here is roughly what that looks like in the routing logic:
private getOrderedProviders() {
return this.config.providers
.filter((p) => {
const failures = this.consecutiveFailures.get(p.name) ?? 0;
return failures < this.config.failoverThreshold;
})
.sort((a, b) => b.weight - a.weight)
.concat(
// Add back failed providers as last resort
this.config.providers.filter((p) => {
const failures = this.consecutiveFailures.get(p.name) ?? 0;
return failures >= this.config.failoverThreshold;
})
);
}
Healthy providers are sorted by weight and tried in priority order. Providers above the failure threshold are appended at the end - still available if everything else fails, but never tried first.
What gets retried vs. what doesn't
Not every delivery failure is worth retrying across providers. The router distinguishes between retryable and non-retryable errors.
Retryable:
- Provider API timeout or connection failure
- 5xx responses from the provider (server errors on their end)
- Rate limit responses (429) when the limit is provider-side, not policy-side
Not retryable:
- Invalid recipient address (the address is bad; another provider won't fix that)
- Authentication failure (your API key is wrong for this provider)
- Malformed payload (your template rendered something invalid)
- Policy blocks (suppressed contact, rate limit exceeded by your policy - another provider won't change the policy decision)
The difference is important for agents. A policy block from Molted's policy engine is a deliberate decision: "this send should not happen right now." That decision doesn't become un-blocked by switching providers. If a contact is suppressed, they stay suppressed regardless of which delivery infrastructure you use.
Retryable errors, on the other hand, are infrastructure failures that have nothing to do with whether the send should happen. Trying a different provider is the right response.
Why single-provider setups are risky for agents specifically
Most teams start with a single delivery provider. It's simpler. One API key, one configuration, one dashboard. For low-volume human-authored email, the risk is manageable - you notice outages and respond.
AI agents change this in two ways.
Volume at events. Agents send email in response to events, not on a schedule. When you onboard 50 trial users in an afternoon, your agent sends 50 onboarding sequences. If your provider goes down for 30 minutes during that window, you've silently dropped up to 50 first-touch emails. The users signed up and heard nothing. First impressions matter in onboarding.
No human in the loop. An agent that calls POST /send and gets a failure response has two options: surface the error to whoever called it, or retry. Most agents retry with exponential backoff. If the provider is down for an hour, that's a lot of retries queuing up. When the provider recovers, they all fire at once - potentially in the wrong order, potentially violating cooldown windows, potentially re-contacting people who already received the email through another path.
Multi-provider failover eliminates the failure scenario rather than managing it after the fact.
What this looks like from the agent's perspective
From the agent's side, none of this is visible. The agent calls the send endpoint, gets back a queued or sent response, and moves on. The decision trace records which provider actually delivered the message - that's accessible in the portal and via the API for audit purposes.
When a send fails over to a secondary provider, the response the agent receives is identical to a primary-provider success. The providerMessageId in the response will be from whichever provider actually delivered the message.
If all providers fail (which is rare but possible during major incidents), the send is marked as retryable and queued by the worker for automatic retry. The agent gets a queued response with a request ID it can use to check status.
Decision traces capture provider routing
Every send generates a decision trace that records the full path: which providers were tried, which succeeded, and the outcome. If you're debugging a delivery issue or reviewing send behavior, the trace shows you exactly what happened at the infrastructure layer without requiring you to cross-reference multiple provider dashboards.
For compliance and audit purposes, this matters. If a regulator asks "did this email go out, and how," the decision trace answers both questions in one place - the policy evaluation, the provider routing, and the delivery confirmation.
Delivery reliability is infrastructure. You don't build your own load balancer; you use one. Multi-provider failover is the same principle applied to email delivery. The complexity lives in the infrastructure layer so it doesn't have to live in your agent's retry logic.
Molted handles provider routing automatically on every send. If you're building an agent that sends email and want delivery reliability without managing multiple provider accounts yourself, try Molted or read the docs.
For the broader picture of what else can go wrong with agent email delivery, Email Deliverability for AI Agents: A Technical Guide covers the full set of risks and how to protect against them.