Echo retries transient failures automatically and writes unrecoverable batches to a dead letter queue so no event is permanently lost. Every destination gets the same operational baseline.

The reliability story

Server-side event delivery isn't always a one-shot success. Ad platform APIs occasionally hit rate limits. Networks blip. Platform backends experience transient 5xx errors during deploys. Without proper handling, any of these would cause events to silently disappear, producing under-reported attribution and broken audiences.

Echo handles this in three layers: batching (chunk events to safe per-request limits), automatic retry (retry transient failures with exponential backoff), and a dead letter queue (preserve failed batches for inspection or replay). Every CAPI-style destination in Echo has the same baseline so behavior is predictable regardless of which platform you're targeting.

Batching

Each ad platform has a maximum number of events you can include in a single API request. Echo chunks your events to a safe size below each platform's hard limit to avoid 400 Bad Request errors on oversized payloads.

Meta, TikTok, Pinterest, Snapchat, Reddit, GA4, and Google Ads chunk at 500 events per request (all platforms have 1000-event hard limits, so this gives a safety margin). LinkedIn chunks at 100 per its documented limit. Microsoft Ads chunks at 50 per its ApplyOfflineConversions documented cap. Floodlight is a special case, since it has no batch endpoint and requires one GET per event; Echo handles each event as its own batch for uniform retry and DLQ handling.

Per-batch error handling means one bad batch doesn't take down others. If an invocation has 1500 events that split into three batches of 500, and the middle batch fails after retries, the other two batches still deliver successfully. Only the failed batch lands in the DLQ.

Automatic retry

When a batch fails with a transient error, Echo retries up to 3 times with exponential backoff: 500ms, then 2 seconds, then 5 seconds. That's a maximum added latency of about 7.5 seconds per failing batch before giving up.

Not all errors are retryable. Echo retries on HTTP 429 (rate limit), HTTP 5xx (server errors), and network-level failures (connection refused, timeouts). These represent transient conditions where the next attempt might succeed. Echo does not retry HTTP 400 Bad Request, 401 Unauthorized, or 403 Forbidden errors, because those typically mean bad payload or bad credentials and retrying won't help; they go straight to the DLQ.

Retry is only safe on destinations that support idempotency via event_id or similar. Meta, TikTok, Pinterest, Snapchat, and Reddit all support this. If Echo's first attempt succeeded but the response got lost in transit (a common timeout scenario), retrying would normally create a duplicate. These platforms handle the dedup so you never double-count. GA4 Measurement Protocol is not retried, because it has no native dedup and retrying on an ambiguous failure would inflate numbers.

event_id makes retry free

Every event Echo sends includes a unique event_id generated by the SDK. Ad platforms use that ID to deduplicate any duplicate deliveries, regardless of what caused them. This means retries cost nothing in data quality: they just improve delivery rates without inflating counts.

Dead letter queue (DLQ)

If all retries are exhausted, Echo writes the failed batch to the DLQ. The DLQ uses a "claim check" pattern: the full batch payload goes to S3 at a path like s3://tagpipes-echo-dlq/failed/{"{"}date{"}"}/{"{"}destination{"}"}/{"{"}sourceId{"}"}/{"{"}batchId{"}"}.json, and a small pointer message goes to SQS with metadata (destination, source, reason, S3 location).

This preserves the events for inspection or future replay. A separate replay worker can drain the SQS queue, fetch each batch from S3, and re-send to the destination. Because every event carries an event_id, replay is safe on idempotent destinations: any duplicates are deduped by the platform.

DLQ payloads expire from S3 after 30 days via a bucket lifecycle policy. SQS messages retain for up to 14 days per SQS standard retention. Replay sooner than those limits if you need the data.

What you see in the dashboard

The Echo dashboard inside TagPipes shows per-destination metrics updating in near-real-time:

Events Received — count of events the Processor ingested for this destination.
Events Sent Success — events successfully delivered.
Processing Errors — events that failed delivery even after retries.
DLQ Writes — number of batches pushed to the DLQ.
DLQ Events — actual event count in the DLQ (batches times events per batch).

Healthy operation: received and sent success climb together. DLQ metrics stay at zero. If DLQ writes are non-zero, something is consistently failing and worth investigating in CloudWatch logs for the tagpipes-echo-processor Lambda.

Tips

Healthy operation has zero DLQ writes. Investigate any climb, even a small one, before it becomes a trend.
Access tokens expire on most platforms. A sudden DLQ climb for one destination usually means a token rotation is due.
CloudWatch logs for the Processor Lambda contain detailed retry attempts like Meta CAPI batch 2/3 failed (HTTP 503). Retrying in 2000ms. which let you see the retry loop working.
DLQ payloads contain the full transformed event, so you can inspect exactly what was being sent when it failed.
GA4 is a single-attempt destination. Expect higher sensitivity to transient network issues for that destination compared to CAPI destinations with retry.

Troubleshooting

DLQ writes climbing for one destination

Open the S3 DLQ bucket and look at a recent payload. The failureReason field in the payload tells you what went wrong. Common causes: expired access token (401/403), hitting rate limits repeatedly (429 even after retries exhausted), or platform API outage (5xx for extended period).

Processing errors rising without DLQ writes

This pattern usually means the Lambda is hitting transient errors that don't reach the DLQ path, like a Lambda timeout or an exception during event transformation (not during API call). Check CloudWatch for Lambda errors and memory usage. Increase Lambda timeout or memory if needed.

DLQ configured but writes not appearing

Check the Lambda env vars DLQ_S3_BUCKET and DLQ_SQS_QUEUE_URL are set. Also check the Lambda execution role has s3:PutObject on the DLQ bucket and sqs:SendMessage on the queue. If either is missing, the DLQ write logs as [DLQ] Failed to write... AccessDenied but doesn't block the Processor.

Reliability and Failures