← Insights
maintain

Queue Architecture for Async Workloads: A Teardown of Where Background Jobs Go to Die

Background jobs pile up, retry forever, or vanish, and nobody knows until a customer complains. A teardown of the failure modes and the fix

A customer emails: "I never got my invoice." You check. The order is there, the payment cleared, the record is correct. But the job that generates and sends the invoice — that ran somewhere in the dark, and it either failed silently three days ago, or it's been retrying every thirty seconds since and hammering your PDF service, or it simply vanished and there's no trace it ever existed. You don't know which, because the only visibility you have into your background work is a customer complaint.

This is the natural end state of background processing that grew by accretion. Somebody needed to send an email without blocking the request, so they pushed it onto a queue. Then it was thumbnail generation, then webhook delivery, then the nightly billing run, then a dozen more. Each was added in an afternoon. Nobody designed the queue layer — it accumulated. And accumulated queue layers all fail the same handful of ways.

Let's take those failures apart, because each one points directly at a design decision that prevents it.

Failure mode 1: jobs vanish

The worst failure is the silent one. A job gets pulled off the queue, the worker starts processing it, and the worker dies mid-flight — out of memory, a deploy restart, the box rebooting. The job is gone. It was removed from the queue when the worker grabbed it, never finished, and there's no record it existed. The invoice never sends and nothing anywhere says so.

This is an acknowledgement problem. A queue that deletes a message the instant a worker reads it has chosen speed over safety, and it loses jobs on every crash. The correct model is acknowledge after completion: the worker pulls the message, the message stays invisible-but-present on the queue with a visibility timeout, and only when the work succeeds does the worker acknowledge and the queue delete it. If the worker dies first, no ack arrives, the timeout expires, and the message reappears for another worker to pick up.

pull -> message hidden (visibility timeout running)
  work succeeds -> ACK -> message deleted        (done, once)
  worker dies   -> no ACK -> timeout -> message reappears  (retried)

"At-least-once delivery" is the property you want, and it's why your jobs stop vanishing. It comes with a price you have to pay deliberately, which is the next failure mode.

Failure mode 2: jobs run twice (and you weren't ready)

At-least-once delivery means exactly what it says: a job runs at least once, and sometimes more than once. A worker finishes the work, then dies before it can send the acknowledgement. The queue, hearing no ack, redelivers. Now you've charged the card twice, sent two invoices, double-credited an account.

The answer is not "make delivery exactly-once" — in a distributed system that's a unicorn, and chasing it is how teams build slow, fragile, over-complicated queues. The answer is idempotency: make the job safe to run more than once, so that a second execution produces the same result as the first and no extra side effects.

You build idempotency with keys. Every job carries a unique idempotency key, and before performing any irreversible side effect, the worker checks whether that key has already been processed. Charged this order? Sent this invoice? If yes, skip the side effect and acknowledge — the work is already done. If no, do it, record the key, then acknowledge. The check and the record have to be durable and atomic, or two concurrent retries both pass the check and you're back where you started.

Idempotency is the single most important property of a correct queue system, and it's the one most accretion-built systems lack entirely, because the first version "worked" — jobs ran once, in testing, where nothing crashed. The duplicates show up in production, on the worst day, when something restarts mid-flight.

Failure mode 3: jobs retry forever

A job fails. The system retries it. It fails again — because the failure is permanent, not transient. The PDF can't generate because the order references a deleted product. The webhook target returns 410 Gone forever. But the queue doesn't know the difference between "try again in a second, it'll work" and "this will fail until the heat death of the universe," so it retries, and retries, and retries.

This poison message does three kinds of damage at once. It wastes capacity, occupying workers that should be doing real work. It hammers downstream dependencies with doomed requests. And it can wedge the whole queue, blocking everything behind it from ever processing.

Two design decisions defuse it. First, bounded retries with exponential backoff and jitter — retry a handful of times, waiting longer between each attempt (1s, then 4s, then 16s) with randomness so a fleet of failures doesn't synchronize into a thundering herd. Second, the dead-letter queue (DLQ): after the retry budget is exhausted, the job doesn't get retried into oblivion and it doesn't get silently dropped — it gets moved to a separate queue where it sits, visible, waiting for a human. The DLQ is the difference between "jobs fail loudly into a tray you can inspect and replay" and "jobs fail into a black hole or an infinite loop." A system without a DLQ has no good answer for permanent failure, so it picks a bad one.

Failure mode 4: ordering you assumed but never guaranteed

Some work has to happen in order. "Account created" must process before "account upgraded." But most queues, especially ones with multiple parallel workers, give you no ordering guarantee at all — two workers pull two messages and finish in whatever order they finish, which can be backwards. The "upgraded" event processes against an account that doesn't exist yet, and errors, or worse, half-succeeds.

The mistake is assuming ordering you never designed for. The honest fixes, in order of preference: don't require ordering — make jobs commutative so order doesn't matter, which is by far the most robust. If you genuinely need it, use a queue with ordering guarantees scoped to a key (all events for one account go to the same ordered partition, processed serially, while different accounts still run in parallel) — you get ordering where it matters without serializing your entire throughput. What you must not do is assume a parallel queue preserves order. It doesn't, and the bug that proves it will be a rare, unreproducible data-corruption ghost.

Failure mode 5: no backpressure

Producers enqueue faster than consumers can drain. A traffic spike, a batch import, a fan-out that turns one event into ten thousand jobs. With no backpressure, the queue grows without bound. Latency climbs from seconds to hours — that "real-time" notification arrives tomorrow. The queue's backing store fills. Eventually something falls over.

Backpressure is the system noticing it's overwhelmed and responding deliberately rather than collapsing. Monitor queue depth and the age of the oldest message — the two numbers that tell you whether you're keeping up. When depth climbs, scale consumers to drain faster, or slow producers (rate-limit enqueues, shed low-priority work), or shed load explicitly. A system designed for backpressure degrades on purpose; one without it degrades by accident, which always looks like a total outage.

The failure under all the others: no visibility

Notice the thread running through every failure: nobody knew until a customer complained. Jobs vanished and no alert fired. Duplicates ran and no dashboard showed it. The DLQ filled and nobody watched it. The queue backed up to twelve hours deep and the first signal was a support ticket.

Queues are background work, which means they fail in the background — invisibly — unless you deliberately instrument them. A production queue system needs, at minimum: queue depth and oldest-message-age (are we keeping up?), processing rate and failure rate (is it working?), DLQ size with an alert that pages when it grows (did something break permanently?), and per-job-type latency (is one job type starving the rest?). Without these, your queue is a black box that you only audit when it's already on fire. With them, you see the backup forming, the DLQ filling, the failure rate spiking — while it's still a graph, not yet a complaint.

What fixed looks like

A worker dies mid-deploy, exactly as it always will. The invoice job it was running never gets acknowledged, so after the visibility timeout it reappears and another worker picks it up. That worker checks the idempotency key, sees the invoice was already generated before the crash, skips the duplicate, and acknowledges. The customer gets one invoice. Meanwhile a different job — a webhook to a customer endpoint that's been decommissioned — fails three times, exhausts its retry budget, and lands cleanly in the DLQ, where an alert flags it for a human to inspect and either fix or discard. Queue depth ticked up during the deploy and drained within a minute, and the dashboard showed the whole thing.

No vanished jobs. No double charges. No infinite retry storm. No silent backlog. The background work is no longer a place jobs go to die — it's a system you can see, that recovers from crashes on its own, and that fails loudly and recoverably instead of quietly and permanently. The first you hear of a problem is a graph, not a customer.

This is for you if

You're a founder or engineering leader running a production system with real async work — billing runs, webhook delivery, notifications, document generation, anything that happens outside the request — and your queue layer grew by accretion rather than design. You've already had the "a job silently failed and we found out from a customer" incident, and you'd like it to be the last one.

A queue-architecture engagement runs $50k+: we audit every background workflow, fix the delivery and acknowledgement model so jobs stop vanishing, make the irreversible jobs idempotent, add bounded retries and dead-letter queues, resolve the ordering assumptions, build backpressure, and instrument the whole thing so failures surface on a dashboard instead of in support. For teams whose async workloads are mission-critical and constantly growing, we hold it as a reliability retainer at $15k–$25k/mo.

This isn't for a small app with one fire-and-forget email job and no real volume — a simple managed queue with sensible defaults is fine until the work matters. And it's not for teams that already run idempotent jobs with DLQs and proper monitoring and just want a design review. It's for the team whose background jobs pile up, retry forever, or vanish — and who only ever find out when a customer complains.