Designing for Failure: Circuit Breakers, Retries, and Why One Slow Dependency Takes You Down

A payment provider's API got slow on a Tuesday. Not down — slow. Responses that normally came back in 200ms started taking eight seconds. Your checkout service kept calling it, like it always does, and kept waiting. Within four minutes every request thread in your checkout service was parked, waiting on a dependency that was never going to answer in time. Checkout stopped responding. Then the homepage, which calls checkout to show cart state, stopped responding. Then the whole site was a spinner.

The payment provider was never down. It was slow. Your system turned "one dependency is slow" into "everything is down," and it did that because nothing in the request path was designed to fail. It was designed to succeed, and to wait patiently forever for success that wasn't coming.

This is the most common outage shape in production software, and it's also the most preventable. Let's take it apart.

How a slow dependency becomes a full outage

The mechanism has a name: resource exhaustion through cascading failure. It's worth understanding precisely, because the fix follows directly from the mechanism.

Every service has a finite pool of something that handles concurrent requests — threads, connections, event-loop slots, whatever your runtime calls it. Say it's 200. Under normal load, requests come in, do their work in 200ms, and free the slot for the next request. Two hundred slots churning through 200ms requests handles a lot of traffic.

Now the downstream dependency slows to eight seconds. Each request still grabs a slot, but now it holds that slot for eight seconds instead of 200ms — forty times longer. The slots fill up. New requests have nowhere to go. They queue. The queue grows. From the outside, your service is now "down," even though it's not crashed, not erroring, just completely saturated waiting on someone else.

The cruelty is the blast radius. The slow dependency only touched one code path — checkout calling payments. But because checkout is now saturated, anything that calls checkout is now blocked too. The failure propagates upward through every caller, and a problem isolated to one third-party API becomes a sitewide outage. One slow thing, no containment, total collapse.

payment API: 200ms -> 8000ms   (slow, not down)
  checkout: 200 slots, all parked waiting   (saturated)
    homepage: calls checkout, now blocked    (saturated)
      => full outage from a dependency that never went down

The root cause is not the slow API. Dependencies get slow; that's a given, not an anomaly. The root cause is that your system had no answer for "what do we do when something we depend on stops behaving."

Timeouts: the failure most people skip

Before circuit breakers, before retries, there is the single most-skipped resilience control: the timeout. Most outages of this shape happen because somewhere in the stack, a network call was made with no timeout, or with the default timeout, which on many HTTP clients is infinite or some absurd number like 120 seconds.

A call with no timeout is a promise to wait forever. In the scenario above, an eight-second dependency would have been survivable with a 500ms timeout — the request fails fast, frees the slot, and the slot goes back to serving traffic that can succeed. Instead, infinite timeouts meant slots were held until the dependency answered, which is what killed you.

Every network call needs a timeout, and the timeout should be set deliberately, not inherited. The right number is "how long am I willing to hold a slot waiting on this?" — usually well under a second for an internal call, a few seconds at most for an external one. The goal of a timeout is not to wait long enough to succeed. It's to give up fast enough to survive.

This is also where naive retries make everything worse. A team that adds "retry on failure" without timeouts and without backoff turns a slow dependency into a self-inflicted denial-of-service attack: every failed call immediately fires three more calls at the thing that's already struggling. You've tripled the load on a dependency precisely when it's least able to handle it. Retries are a real tool, but only with the right shape, which we'll get to.

The circuit breaker: stop calling the thing that's broken

The circuit breaker is the pattern that directly addresses cascading failure, and it borrows its logic from the electrical kind. When a dependency is failing, you stop sending it traffic for a while. You "trip the breaker."

A circuit breaker wraps calls to a dependency and tracks their outcomes. It has three states:

Closed is normal. Calls pass through to the dependency. The breaker counts failures.

Open is tripped. After failures cross a threshold — say, 50% of calls failing over a 10-second window — the breaker opens, and now calls don't even reach the dependency. They fail instantly, locally, without holding a slot or adding load. This is the critical move: when the payment API is broken, the fastest, safest thing checkout can do is stop calling it and fail immediately, returning a useful error instead of a hung connection.

Half-open is the test. After a cooldown, the breaker lets a small number of trial calls through. If they succeed, the dependency has recovered and the breaker closes. If they fail, it opens again and waits longer.

The circuit breaker turns "wait eight seconds for every doomed call" into "fail in microseconds and free the slot." That single change is the difference between a contained degradation and a total outage. Checkout becomes unavailable, sure — but the homepage stays up, the rest of the site stays up, and the failure is isolated to the one feature that actually depends on the broken thing.

Retries that help instead of hurt

Retries are how you recover from transient failures — the dropped packet, the brief blip, the one request that timed out while the dependency is otherwise healthy. Done right, they paper over the small stuff invisibly. Done wrong, they amplify outages. The difference is three rules.

Exponential backoff. Don't retry immediately, and don't retry at a fixed interval. Wait longer between each attempt — 100ms, then 200ms, then 400ms. This gives a struggling dependency room to recover instead of hammering it on a fixed cadence.

Jitter. Add randomness to the backoff. Without it, a thousand clients that all failed at the same moment all retry at the same moment, producing a synchronized thundering herd that knocks the dependency back down the instant it recovers. Jitter spreads the retries out so recovery can actually happen.

A retry budget, and only on retryable errors. Cap retries — two or three, not "until it works." And only retry things that might succeed on a second attempt: timeouts, 503s, connection resets. Never retry a 400 or a 422. The request was malformed; it will be malformed again. Retrying a deterministic failure is just load with extra steps.

And retries sit inside the circuit breaker, not outside it. The breaker is the backstop: if retries keep failing across many requests, the breaker trips and stops the retrying entirely. Layered correctly, retries handle the blips and the breaker handles the outages.

Graceful degradation: decide what "down" means

The last piece is a product decision disguised as an engineering one. When a dependency is unavailable and the breaker is open, what does the user see?

The lazy answer is an error page. The better answer is degraded function. If the recommendations service is down, show the page without recommendations instead of failing the whole page. If the live inventory count is unavailable, show the product as "in stock" with a cached number and reconcile at checkout. If the payment provider is down, queue the order and tell the user you'll confirm shortly, rather than losing the sale entirely.

Graceful degradation requires deciding, ahead of time, which dependencies are critical (the feature genuinely cannot function without them) and which are enhancing (nice, but the core still works). Enhancing dependencies should never be able to take down the core. The way you guarantee that is by wrapping them in breakers with fallbacks, so that when they fail, the system falls back instead of falling over.

What fixed looks like

The payment API gets slow again — because it will. This time, the checkout service's calls to it have a 500ms timeout. After a handful of timeouts, the circuit breaker trips. Checkout stops calling the slow dependency entirely; its requests fail in microseconds, so its slots never fill. The homepage, which depends on checkout, keeps serving because checkout is still responsive — it's just returning a clean "payments temporarily unavailable" for the one path that needs payments. Orders queue for later confirmation instead of vanishing.

The blast radius is one feature, degraded, for the few minutes the provider is slow. Not the whole site, down, for as long as it takes someone to notice and restart things. When the provider recovers, the half-open breaker notices on its own and traffic resumes. Nobody got paged. The incident is a line on a dashboard, not a war room.

This is for you if

You're a founder or engineering leader running a production system with real dependencies — payment providers, third-party APIs, multiple internal services — and you've already had at least one outage where something slow took down something that shouldn't have been affected. You want failure contained by design, not by luck.

A resilience engagement runs $50k+: we map your dependency graph, find the unprotected network calls, and implement timeouts, circuit breakers, properly-shaped retries, and graceful degradation on the paths that matter — then prove it works by deliberately breaking dependencies in a controlled test. For teams running mission-critical systems where the cascading-failure risk is ongoing, we hold the line as a reliability retainer at $15k–$25k/mo.

This is not for pre-launch products with a single dependency and no real traffic — you can add breakers later when there's something to protect. And it's not for teams that already run mature resilience patterns and just want a config review. It's for the team that's had the "one slow thing took down everything" outage and doesn't want a second one.