At 11:40 on a Thursday a customer pushed a new version of their integration with a bug in the retry loop. Instead of one request per webhook it sent the same request in a tight loop — about nine thousand a second. It wasn't malicious. It was a while loop missing a break. But your platform had no per-customer limit, so that one customer's runaway loop consumed the connection pool, the worker queue, and the database's attention. Every other customer's requests started timing out. One buggy integration, and your entire multi-tenant platform was down for everyone.
This is the failure that should keep you up: in a shared system, the absence of limits means any single tenant — through malice, a bug, or a sudden legitimate surge — can consume capacity that belongs to everyone. The platform's stability becomes hostage to the worst-behaved client. Rate limiting isn't a feature you add for security theater. It's the wall that keeps one tenant's bad day from becoming everyone's bad day.
Here's how we build that wall.
The algorithm matters more than people think
"Add rate limiting" usually means "block clients over N requests per minute," and the naive implementation of that is worse than none, because it has its own failure mode. Let's get the algorithm right first.
Fixed window is the obvious version: count requests per calendar minute, reject over the limit, reset at the top of the minute. It's simple and it's wrong at the edges. A client allowed 100/minute can send 100 in the last second of one window and 100 in the first second of the next — 200 requests in two seconds, double the limit you thought you set, right at the boundary. The reset is a synchronized cliff that invites bursts.
Sliding window fixes the boundary by counting over a rolling trailing period rather than a fixed calendar block. Smoother, fairer, no edge burst. Slightly more to compute, worth it.
Token bucket is what we reach for most. Each client has a bucket that refills at a steady rate — say 100 tokens per minute — up to a cap. Each request spends a token; no token, request rejected. The elegance is that it allows controlled bursts: a client that's been quiet has a full bucket and can briefly spike, which matches how real, legitimate clients behave, while the steady refill rate enforces the long-run average. It absorbs honest bursts and still stops sustained abuse.
token bucket: refill 100/min, cap 100
quiet client -> full bucket -> can burst to 100 immediately
busy client -> empty bucket -> throttled to refill rate
enforces the average, forgives the spike
For a multi-tenant API, token bucket per tenant is the default. It's the algorithm that doesn't punish good clients for occasionally being bursty while still containing the runaway loop.
Per-tenant quotas are the actual point
A single global rate limit protects the platform from total collapse but does nothing about fairness, and fairness is the real requirement in a shared system. A global limit means a noisy tenant can still consume most of the shared budget and starve the quiet ones — you've capped the total but not the distribution.
Limits have to be scoped per tenant, and ideally layered:
Per-tenant is the load-bearing layer. Each customer gets their own bucket, so one customer's usage — runaway loop included — can only ever exhaust their own allocation. The Thursday incident is impossible the moment this exists: the buggy integration burns through its own tenant's quota and starts getting throttled, and every other tenant is completely unaffected. The blast radius is exactly one customer, which is the customer who caused it.
Per-endpoint layers on top, because not all requests cost the same. A read from cache and a report-generation request that scans millions of rows are wildly different loads, and a flat per-tenant limit treats them identically. Tiered limits — generous on cheap reads, tight on expensive operations — protect the resources that actually matter.
Tier-aware ties limits to the plan. Enterprise customers get higher ceilings than free-trial accounts, and the limit becomes a product lever as well as a safety control. The free tier's limit is also your first line of defense against scraping and abuse, since abuse overwhelmingly arrives through the cheapest door.
This is where multi-tenant rate limiting stops being a middleware toggle and becomes architecture: the limits have to know who the tenant is, what they're allowed, and what this specific endpoint costs. We build this scoping into the platform's identity layer so every request is attributed and budgeted. How we architect tenant isolation that holds →
Where the counter lives
A rate limiter is a shared counter, and in a horizontally-scaled system that counter has to be shared across all your instances, or it doesn't work. If each app instance keeps its own in-memory count, a client spread across ten instances gets ten times the intended limit. In-memory counters silently multiply your limit by your instance count, which is the same as having no limit during the exact traffic where you need one.
The counter lives in a fast shared store — Redis, typically — and the increment-and-check has to be atomic so two simultaneous requests can't both read "99" and both proceed to 100. That's a single atomic operation (a Lua script or an atomic increment with expiry), not a read-then-write that races. Get this wrong and your limit leaks precisely under concurrency.
And the limiter must fail open, deliberately. If the Redis holding your counters becomes unavailable, the limiter should default to allowing traffic, not blocking it — a rate-limiting outage that blocks all requests turns a safety mechanism into a self-inflicted outage. The limiter exists to protect availability; it must never be the thing that destroys it. Fail open, alert loudly, fix fast.
Abuse detection beyond rate
Rate limiting catches volume. It doesn't catch the patient attacker who stays just under your limits, or the credential-stuffing run that spreads across thousands of IPs at one request each. For that you need behavioral signals layered on top of raw counts.
The signals worth watching: an authentication endpoint seeing a spike in failures (credential stuffing), one account suddenly accessing data at a breadth that doesn't match its history (a compromised or scraping account), traffic from a single source hitting many accounts (distributed brute force), or a sudden shift in the shape of a tenant's traffic. These are anomalies relative to a baseline, and the response is graduated: log, then challenge, then throttle harder, then block — escalating with confidence rather than reacting to a single data point. You don't lock out a paying customer over one weird minute, and you don't wait for a thousand bad requests before you act on an obvious attack.
The principle is layering: rate limits handle volume, behavioral detection handles patterns, and the two together cover both the loud abuse and the quiet kind.
Rejecting gracefully
How you say no matters as much as that you say no, because most rejections are honest clients that bumped a ceiling, not attackers. A good rejection is a clear contract, not a punishment.
Return 429 Too Many Requests with a Retry-After header telling the client exactly when to come back, and rate-limit headers showing their limit, remaining budget, and reset time. A well-behaved client reads those and backs off correctly; a runaway loop is at least told the truth. The rejection itself must be cheap — rejected at the edge, before it touches the database or the worker queue — because a rejection that consumes the resources it was meant to protect is no protection at all. Reject early, reject clearly, reject for almost nothing.
And the limit should never be a silent wall a developer discovers in production. Document it, surface current usage in the dashboard, and warn before the ceiling, not at it.
Protecting the expensive endpoints
The final move is recognizing that a flat limit across all endpoints is the wrong mental model, because your endpoints don't cost the same. The expensive ones — report generation, bulk exports, search across large datasets, anything that fans out to other services or holds a connection for seconds — deserve their own tighter limits and often their own queue.
Push expensive work off the synchronous request path entirely: accept the request, enqueue the job, return a handle, and process at a controlled rate the system can sustain. Now a flood of expensive requests becomes a backlog that drains safely, instead of a synchronous pile-up that exhausts the pool. Concurrency limits on these endpoints — "this tenant may have at most three reports generating at once" — protect the shared resource more precisely than a request-rate limit ever could, because the thing you're actually protecting is concurrent capacity, not request count.
What fixed looks like
The buggy integration ships its runaway loop again — because some customer's while loop will always eventually lose its break. This time it burns through that tenant's own token bucket in seconds and starts collecting clean 429s with a Retry-After, rejected cheaply at the edge before touching the database. Every other tenant sees nothing; their buckets are untouched, their requests are fast. The blast radius is exactly one customer, and it's the one with the bug.
Limits are per-tenant, per-endpoint, and tier-aware, counted in a shared store with atomic increments that hold under concurrency and fail open if that store ever blinks. Expensive operations run through their own tighter limits and a queue, so a flood of reports becomes a backlog instead of an outage. Behavioral detection watches for the abuse that stays under the rate ceiling. One bad actor, one bug, one surge — contained to itself, every time. The platform's stability stops being hostage to its worst-behaved client.
This is for you if
You're running a funded multi-tenant platform or a public API where customers share infrastructure, and you've either already had the "one tenant took down everyone" incident or you can see clearly that nothing today would stop it. You want one client's worst day contained to that client.
A rate-limiting and abuse-prevention engagement runs $50k+: we build per-tenant, per-endpoint, tier-aware limits on a shared atomic counter that holds under concurrency, add graceful rejection and expensive-endpoint protection, and layer behavioral abuse detection on top — then prove it by simulating a runaway tenant and watching everyone else stay up. For a platform that needs full quota, billing-integrated limits, and a real abuse-response system, that program runs $100k+.
It's not for you if you're a single-tenant internal tool with a handful of trusted users — there's no shared blast radius to contain, and a basic limit is plenty. It's for the platform where tenants share capacity and one of them, someday, will send nine thousand requests a second by accident.
// contain the blast to the tenant that caused it
< transmit >