Event-Driven Architecture and Webhooks Done Right

A customer emails support: their data is wrong. An order they placed two days ago never showed up in your system, even though their side says it sent the webhook. You check your logs. There's no record of the request. Or there is a record, and it returned a 500, and then nothing happened because your handler didn't retry. Either way, an event that mattered vanished, and you found out about it because a human noticed, which is the worst possible monitoring system.

Now multiply that. If one event silently dropped, others did too. You have no idea how many, because the whole problem with silent drops is they're silent. The integration "works" in the demo and in the happy path, and then it loses 0.3% of events under load, and 0.3% of a payment stream is a number that ends up in a support escalation with the word "reconciliation" in it.

Event-driven systems fail quietly by default. Building one that fails loudly and recovers automatically is a specific set of decisions, and most teams skip all of them because the happy path ships fine.

What the silent drop actually costs

The naive webhook handler is a synchronous HTTP endpoint that receives a payload, does the work inline, and returns 200. It looks correct. It is correct exactly when the network never blips, your database never has a slow moment, the sender never retries, and events never arrive out of order. None of those hold in production.

Here's the failure chain. The sender posts an event. Your handler starts processing, takes four seconds because the database is busy, and the sender — which has a 3-second timeout — gives up and marks the delivery failed. It retries. Now your handler is processing the same event twice, concurrently, and double-charges a customer or creates a duplicate order. Meanwhile a different event timed out completely and the sender, after three retries, dropped it on the floor. You charged one customer twice and lost another's order, from the same naive handler, on the same Tuesday.

The cost isn't the engineering time to fix it. It's the reconciliation: someone exports both systems, diffs them by hand, finds the 200 events that disagree, and manually repairs each one — while customers who hit the gap are already churning. A data-integrity incident in a payment or order flow routinely costs a small team a full week of senior time plus whatever the refunds and lost trust add up to. Build the delivery guarantees in up front and that incident simply doesn't happen.

Receive fast, process reliably

The first rule: your webhook endpoint should do almost nothing. Verify the signature, write the raw event to durable storage, return 200. That's it. Processing happens after, in a worker, off the request path.

// the entire HTTP handler
verify(signature, body) || reject 401
enqueue(rawEvent)        // durable
return 200               // under 100ms, always

Why this matters: the sender's only contract with you is "did you get a 200 fast." If processing is inline and slow, the sender times out and retries, and you've manufactured duplicates. Acknowledge receipt the instant the event is safely persisted, then let a worker do the real work at its own pace, with its own retries, independent of the sender's timeout. This one split eliminates an entire category of failure.

Idempotency is the load-bearing requirement

Once you accept that retries happen — from the sender, from your own worker, from a redeployed queue — you accept that every event will sometimes be processed more than once. The only safe response is to make processing idempotent: handling the same event twice produces the same result as handling it once.

The mechanism is a stable, unique event ID supplied by the sender. Before you process, you record that ID. If you've seen it, you skip. This has to be atomic with the work itself — insert the event ID and apply the side effect in the same transaction, or you'll have a race where two concurrent retries both check "not seen," both proceed, and both charge.

Idempotency is what makes "at-least-once delivery" safe. You can't get exactly-once delivery over a network — it's a fairy tale. What you can build is at-least-once delivery plus idempotent processing, which behaves like exactly-once. Every reliable event system in production is built on this pairing. Skip the idempotency half and at-least-once becomes "duplicate-charge-sometimes."

Retries, backoff, and the dead-letter queue

When processing fails — the downstream API is down, the database deadlocked — you retry. But naive immediate retries make a struggling system worse: you hammer the failing dependency at full speed and turn a blip into an outage.

Retry with exponential backoff and jitter: wait 1s, then 2s, then 4s, then 8s, with a random offset so a thousand failed events don't all retry in lockstep. Cap the attempts. And here's the part teams forget — where does an event go when it has failed every retry?

It goes to a dead-letter queue. A DLQ is the holding pen for events that couldn't be processed after exhausting retries. It is the single most important reliability feature in the whole design, because it's the difference between "an event failed and we know exactly which one, with its full payload, ready to replay" and "an event failed and disappeared." The DLQ turns silent drops into visible, replayable, alertable items. Wire an alert to it: anything landing in the DLQ pages someone. That's how you find out about a problem from a dashboard instead of from a customer.

Ordering and the outbox pattern

Two more sharp edges.

Ordering. Events often arrive out of order — an "order.updated" can land before the "order.created" it depends on. If your processing assumes order, it breaks. Either make handlers tolerate out-of-order arrival (process what you can, hold what you can't, reconcile when the prerequisite shows up) or partition the queue by entity so events for a single order process in sequence while different orders run in parallel. Don't assume global ordering. You won't get it.

The outbox pattern. This solves a subtle bug that bites teams emitting their own events. You update your database and then publish an event to a queue — two separate systems. If the database commit succeeds but the publish fails, you've changed state without telling anyone. If the publish succeeds but the commit rolls back, you've announced something that didn't happen. Both are corruption.

The outbox fixes it: write the event into an outbox table in the same transaction as the state change. Either both commit or neither does — atomicity you actually have, because it's one database. A separate relay reads the outbox and publishes to the queue, marking rows sent. Now your event stream can never disagree with your database, because they commit together.

Signature verification — the part that's also security

Last: an inbound webhook is an unauthenticated POST from the public internet until you prove it isn't. Anyone who learns your endpoint URL can forge events — fake payments, fake order cancellations — unless you verify the signature the sender includes. Compute the expected HMAC over the raw body with the shared secret and compare in constant time. Reject mismatches with a 401. Verify against the raw bytes, before any parsing, because re-serializing the JSON changes the signature. Skip this and your reliable, idempotent, retrying pipeline is a reliable, idempotent way to process attacker-supplied events.

What fixed looks like

Fixed is a webhook endpoint that verifies, persists, and returns 200 in under 100ms — then a worker that processes idempotently, retries with backoff, and routes terminal failures to a dead-letter queue that pages a human. Nothing drops silently because there's nowhere for it to drop silently to.

Fixed is your own emitted events written through an outbox in the same transaction as the state they describe, so your event stream and your database can never disagree. Out-of-order arrivals are handled, not assumed away. Forged events are rejected at the door.

Fixed is finding out about integration failures from an alert on the DLQ, with the full payload ready to replay, instead of from a customer's reconciliation email three days too late.

This is for you if

You're moving real money, orders, or state changes across system boundaries, and a silent drop turns into a data-integrity incident with your name on it. We design and build event pipelines with the delivery guarantees baked in — idempotency, backoff, DLQs, the outbox, signature verification — typically $50k+ for a production-grade event and webhook layer, $100k+ when it's the spine of a multi-system platform with replay and reconciliation tooling. A teardown of your current integration that maps every place an event can vanish starts at $25k+.

This is not for you if your events are low-stakes and an occasional drop is genuinely fine — a "user viewed page" stream doesn't need an outbox. It's not for you if you're fully on a managed platform that already provides these guarantees and you just need to use its primitives correctly. And it's not for you if you want the fast happy-path handler that demos well and quietly loses 0.3% of events — that's the handler you have now, and it's why a customer just emailed support.