Why Webhook Delivery Fails (and What to Do About It)
Most developers treat webhook delivery as an afterthought. You make an HTTP POST to a configured URL, check for a 200, and move on. That's fine for prototypes. But the moment webhooks carry real business data — payment confirmations, order status changes, subscription lifecycle events — reliability becomes the whole game.
The problems aren't exotic. Receiving endpoints go down. Networks time out. Third-party services get overloaded. What looks like a simple push notification turns into a distributed coordination problem the moment anything goes wrong. This article covers how to build a webhook system that handles failure gracefully on both sides: for teams shipping webhooks to customers, and for teams consuming them from external providers.
The Core Delivery Problem
A webhook is a fire-and-forget HTTP request — except nothing is truly fire-and-forget when money is involved. If the receiving server returns a 500, takes 30 seconds to respond, or is simply unreachable, your webhook failed to deliver.
The naive fix is to retry immediately. But that creates a new problem: if you retry too fast, you hammer a server that's already struggling. If you retry too many times without care, you replay events that may already have been processed. This is where idempotency enters the picture — but let's start with getting delivery right first.
Building a Reliable Webhook Delivery Queue
Webhook delivery should never happen inline with your request cycle. The moment a triggering event occurs — payment confirmed, order placed, status changed — you enqueue the webhook rather than send it. The pattern:
- Write the webhook event to a persistent queue (database table, Redis, SQS, Laravel Horizon — whatever matches your stack)
- A worker process picks up the job and attempts delivery
- On success, mark the event delivered and log the full response
- On failure, schedule a retry
This decouples the triggering event from the delivery attempt. Your API stays fast and your users don't wait on an outbound HTTP call during their request.
The delivery payload should be immutable once created. Log the full request body and headers alongside the event record so you can debug disputes later. "We sent that" is a claim you'll eventually need to prove.
Retry Strategy: Exponential Backoff with Jitter
The most common retry failure mode is thundering herd — every failed webhook retrying at exactly the same time, all hammering a recovering endpoint at once. Exponential backoff with jitter prevents this.
A reasonable retry schedule:
- Attempt 1: immediate
- Attempt 2: 30 seconds
- Attempt 3: 5 minutes
- Attempt 4: 30 minutes
- Attempt 5: 2 hours
- Attempt 6: 12 hours
- Final attempt: 24 hours, then mark dead
Jitter means adding a small random offset (±20%) to each delay so that a batch of failures doesn't retry in lockstep.
After the final attempt, move the event to a dead letter queue — a separate table or queue for permanently failed deliveries. This is not giving up. Dead letter queues give your team a place to investigate, replay, or escalate. A failed webhook that silently disappears is a support ticket waiting to happen.
Also respect your customers. If an endpoint returns 410 Gone, stop retrying immediately. If an endpoint has failed every delivery for 48 consecutive hours, pause delivery and notify the customer. Don't burn requests against a dead endpoint indefinitely.
What Counts as a Successful Delivery
Status code is the only signal you should trust. A 200 or 204 is success. 4xx errors (except 429 Too Many Requests) are usually client configuration problems — a bad URL, a missing auth header — and retrying won't fix them. 5xx errors and timeouts warrant retries.
Set a firm request timeout. Thirty seconds is generous; most receiving endpoints should respond in under five. A hanging connection isn't a success — close it and treat it as a failure.
Don't parse the response body to determine success. A webhook consumer should return a fast 200 before doing any meaningful processing. If you gate success on response content, you're coupling your retry logic to implementation details on the other end.
Idempotency: The Receiver's Responsibility
Here's the part that often gets skipped: idempotency matters just as much for consumers as for senders.
Because webhooks get retried, your endpoint will sometimes receive the same event more than once. This is not a bug in the sender — it's a fundamental design constraint of any reliable delivery system. Your handler has to cope with it.
The standard approach uses the event ID as an idempotency key:
- Extract the event ID from the incoming webhook payload
- Check whether you've already processed an event with that ID
- If yes, return 200 immediately and do nothing
- If no, process the event, then record the event ID as processed
The processed event log should be written in the same database transaction as the actual work — or at minimum before you return 200. If you process the event but crash before recording the ID, you'll reprocess it on the next retry. That's exactly the bug you're trying to prevent.
For most SMB contexts, a unique constraint on event_id in a processed-events table with a 30–90 day retention window is sufficient. Index the column, add a cleanup job, and move on.
Signing and Verification
Reliability is only half the problem. Before your handler does anything, verify that the webhook actually came from the sender you expect.
Most webhook providers use HMAC-SHA256 signatures. The sender computes a signature over the raw request body using a shared secret and includes it in a header (Stripe-Signature, X-Hub-Signature-256). Your receiver recomputes the same signature and compares.
Two implementation details that trip people up:
- Use the raw body, not the parsed body. JSON parsing changes whitespace and key ordering, breaking the signature.
- Use a constant-time comparison function. Normal string equality is vulnerable to timing attacks. Most languages provide
hmac.compare_digestor equivalent.
If you're building the sender side, generate one secret per endpoint — not one global secret — so that a compromised customer credential doesn't affect anyone else.
Logging and Observability
Every webhook attempt — success or failure — should produce a record containing:
- Event ID and event type
- Target endpoint URL
- HTTP status returned (or timeout/network error description)
- Attempt number
- Response latency in milliseconds
- Timestamp
With these fields, you can answer the questions that always surface: "Did you send that webhook?" "When?" "What did you receive back?"
Build a delivery log view into your admin panel. This pays for itself the first time a customer opens a ticket claiming they never received an event. A complete delivery history — all attempts, response codes, timestamps — closes those tickets in minutes instead of hours.
On the consumer side, log the full raw payload before processing. Storage is cheap. Having the original payload when debugging a processing bug later is invaluable.
Common Mistakes That Compound Over Time
A few patterns that cause real production problems for teams that didn't address them early:
Synchronous delivery: Sending webhooks inline with the user request. One slow downstream endpoint adds latency to every user action.
No dead letter queue: Failed webhooks disappear silently. You have no way to replay or investigate what broke.
Shared secrets across all endpoints: One exposed customer secret gives attackers a vector into all webhook traffic.
Missing idempotency on the consumer side: Retried webhooks create duplicate orders, duplicate charges, or duplicate notifications.
Timeouts that are too generous: Holding connections open for 60+ seconds burns workers and masks processing problems on the receiver.
No endpoint health tracking: Continuing to hammer a consistently-failing endpoint wastes resources and delays detection of a misconfigured integration.
Getting This Right Is Worth the Effort
Webhooks are infrastructure — invisible when working, catastrophic when not. Developers integrating with your platform build workflows that depend on reliable delivery. A single missed payment confirmation, a dropped subscription lifecycle event, or a duplicated order because idempotency wasn't handled becomes a user-visible bug that erodes trust quickly.
The good news: getting this right isn't especially complex. It's mostly discipline — queue it, retry with backoff, sign it, log everything, and make your consumer idempotent. These aren't advanced techniques; they're the baseline for any reliable integration.
Dev Paragon has built webhook infrastructure for B2B SaaS platforms, marketplace integrations, and multi-system ops tools where event delivery reliability directly affected billing accuracy and customer trust. If you're designing an API or event-driven integration and want to get the plumbing right from the start, we're glad to talk through your architecture.
0 Comment