Retries & error handling

Three layers of error handling, in increasing scope:

Per-node retries, automatic for transient failures.
Error-handling nodes, explicit recovery paths in the workflow.
Workflow-level error policy, what to do when a node failure isn't caught.

Pick the lightest tool that does the job.

Layer 1, Per-node retries

Many nodes (HTTP Request, connector nodes, database, etc.) have retry settings in their inspector:

Field	Notes
Max retries	Number of additional attempts after the first failure
Retry delay	Initial wait between attempts (ms)
Backoff strategy	`fixed` (same delay each time) or `exponential` (delay × 2 each retry)
Retry on	Which error types to retry, `5xx` only (default), `5xx + timeout`, `all`

Use for: transient network blips, brief upstream slowness, occasional rate-limit hiccups.

Don't use for:

Idempotency-sensitive operations (sending email, charging a card) unless the service provides an idempotency key.
Authentication errors, retrying with a bad credential just generates more 401s.

Layer 2, Error-handling nodes

When you need a real recovery path, wire the Error Handling node in front of the risky node:

[Upstream]──→[Error Handling]──→[Risky node: HTTP]
                  │                  │
                  ↓                  ↓
              [error path]      [success path]

The Error Handling node:

Wraps the risky node in a try/catch.
Catches per-error-type (filter by timeout, network, validation, authentication, custom regex).
Can retry the risky node N times itself before giving up.
Routes to the error port with the caught exception details for graceful handling.

Use for:

"Try the primary API, fall back to the cached value."
"Send to Slack; if that fails, send to email."
"Process this record; if it fails, log to a dead-letter queue and continue."

Layer 3, Workflow error policy

When a failure isn't caught by layers 1 or 2, the workflow's error policy decides what happens. Set under Workflow settings → Error policy:

Policy	Behaviour
`fail workflow` (default)	The whole execution is marked `failed`. Downstream nodes don't run.
`continue`	The failing node's status is `failed` but the execution continues with empty output on that port. Workflow as a whole completes as `success` if everything else succeeds.
`route to error path`	If the workflow has any node connected via an error edge, route there. Otherwise behaves like `fail workflow`.

Most production workflows want fail workflow, explicit failure is loud and obvious. Use continue when partial completion is acceptable (a daily report that's allowed to skip the section it couldn't generate).

Continue on fail (per node)

Some nodes have an inspector toggle Continue on fail. When true, a node failure is treated as success for the purpose of fail workflow policy, the workflow doesn't abort.

The failing node still records a failed status, but downstream nodes run with null (or the node's configured fallback) as input.

Use sparingly. It's a foot-gun: silent partial failure is harder to debug than loud full failure.

Idempotency

If you're going to retry an operation, make sure repeating it is safe:

Send email, providers like SendGrid accept an idempotency key (or X-Idempotency-Key header). Use the trigger event ID as the key.
Create CRM record, most CRMs have upsert operations that look up by external ID; use that.
Charge a card, Stripe / similar require an Idempotency-Key header. Set it from your trigger event ID.

For operations that genuinely have no idempotency support, don't retry. Wrap with an Error Handling node and surface the failure for manual review.

Dead-letter pattern

For workflows that process many items, the conventional pattern:

[Trigger: queue / batch]
   ↓
[Loop]──→[Try: process item]
              │
              ├──→[success: continue]
              └──→[error: log + push to dead-letter queue]
   ↓
[Done]

Failed items don't stop the whole batch, they're set aside for human review. Implement the dead-letter queue with a Database write, a Cache entry, or a connector to your real DLQ system.

Tips & gotchas

Retries multiply cost. A workflow that retries 3× on a paid API can quadruple your bill on a bad day.
Backoff strategy "exponential" with high retry count can be huge. 5 retries with 2× backoff and 1 s initial = 1 + 2 + 4 + 8 + 16 = 31 seconds of waiting per failure. Worth it for resiliency; bad for latency.
Workflow timeout includes retry waits. If your timeout is 30 s and you have a node that retries with exponential backoff, you may time out during retries before the node even has a chance to fail definitively.
Don't retry on 4xx by default. Client errors aren't transient, retrying just confirms the same error.

Found something out of date? This page lives in the Flero docs content set.