← Insights
maintain

The Cost of Downtime, and How to Actually Buy Uptime

Every nine of availability costs real money, and most teams over-buy or under-buy because they never did the math. A decision model for buying uptime on purpose

Pick the right answer: what should your uptime target be?

Most founders answer "as high as possible," which is the one answer that's guaranteed to be wrong. "As high as possible" has no budget, no priority, and no end — it's how you spend $400k chasing a nine that no customer would ever pay for, while the actual outage risk sits untouched in a part of the system nobody thought to harden. The right answer is a number you derived on purpose, and almost nobody derives it.

Here's the uncomfortable truth underneath: every additional "nine" of availability costs real money, and the cost climbs steeply. 99% to 99.9% is cheap. 99.9% to 99.99% costs more than that. 99.99% to 99.999% can cost more than the previous two jumps combined. Teams that never do the math end up in one of two failure modes. They under-buy — running mission-critical infrastructure on best-effort reliability, then losing a quarter's revenue to a single outage that a $30k investment would have prevented. Or they over-buy — burning runway on five-nines redundancy for a system whose customers would happily tolerate a 20-minute monthly maintenance window.

Both come from the same root cause: they never priced their own downtime, so they have no idea what reliability is worth to them. Let's fix that, then spend the money where it actually buys uptime.

The nines, in human units

First, internalize what these numbers mean in time, because percentages hide it.

99%      -> ~3 days 15 hours of downtime per year
99.9%    -> ~8 hours 45 minutes per year
99.99%   -> ~52 minutes per year
99.999%  -> ~5 minutes per year

Three observations that change how you think about the target.

99% sounds great in a sales deck and is genuinely terrible — three and a half days a year. For anything a business depends on, 99% is a non-starter.

The leap from 99.9% to 99.99% is the leap from "a long, painful incident is fine once a year" to "you essentially cannot have a long incident at all." That's not a tuning change. That's a different architecture: redundancy everywhere, automated failover, no single points of failure, and the operational maturity to detect and recover in minutes without a human in the loop.

And five nines — five minutes a year — means you cannot afford the time it takes a human to wake up, read a page, and log in. Everything must self-heal. Most companies that think they need five nines actually need three-and-a-half and a good incident process, and the difference is six or seven figures a year.

The cost model: price your own downtime

Reliability decisions are unmakeable until you can put a dollar figure on an hour of being down. Build it from four inputs.

Direct revenue loss. Revenue that simply doesn't happen while you're dark. Annual revenue divided by business hours gives a rough per-hour number, but be honest about shape — if you're a checkout flow and you go down during peak hours, the real number is several times the average.

Recovery and response cost. Engineers pulled off roadmap to firefight, the overtime, the days of cleanup and data reconciliation after. An outage doesn't end when the site comes back; it ends when the backlog it created is cleared.

Contractual and SLA penalties. If you've signed SLAs with enterprise customers, downtime past the threshold triggers credits or refunds. Read your own contracts. Some teams discover their downtime cost is dominated by penalties they forgot they agreed to.

Trust and churn. The hardest to quantify and often the largest. A B2B customer who watched your product fail during their critical moment starts evaluating alternatives. The cost isn't the outage hour — it's the contract that doesn't renew nine months later, with a deal you'll never trace back to the incident.

Add them up. Now you have a real number — cost per hour of downtime — and reliability spending stops being a feeling and becomes arithmetic. If an hour down costs you $200k and a $50k investment removes your single largest outage risk, that's not a cost. That's the cheapest insurance you'll ever buy. If an hour down costs you $800 because you're early and pre-revenue, then spending $200k on redundancy is lighting runway on fire to solve a problem you don't have yet.

Where the next nine actually comes from

Once you know the target, the question is mechanical: what's actually causing your downtime, and what removes the most of it per dollar? This is where over-buyers waste fortunes — they buy redundancy for a layer that was never the problem.

For most teams, the cheapest early nines come from boring places, in roughly this order:

Eliminate single points of failure. One database with no replica, one server, one load balancer, one critical third party with no fallback. Any of these dies and you're down for as long as recovery takes. Adding a standby replica with automated failover is usually the single most productive reliability dollar you'll spend.

Make deploys safe. A large share of outages aren't acts of god — they're self-inflicted, shipped straight to production by the team. Staged rollouts, health checks, and fast automated rollback convert "a bad deploy is a two-hour outage" into "a bad deploy auto-reverts in ninety seconds." Cheap, and it removes a category you're causing yourself.

Detect fast. You cannot recover from what you can't see. If your customers tell you you're down before your monitoring does, your time-to-detect is destroying your uptime number before recovery even starts. Real monitoring and alerting on the metrics that matter is one of the cheapest nines available.

Contain failures. Timeouts, circuit breakers, and graceful degradation so one slow dependency degrades one feature instead of taking down everything. This converts total outages into partial ones — which moves your effective availability more than another layer of redundancy ever will.

Notice what's not at the top: exotic multi-region active-active architecture. That's a real tool for real five-nines requirements, but it's expensive, it adds operational complexity that itself causes outages, and it's the wrong first dollar for almost everyone. Buy the boring nines first.

Diminishing returns, and where to stop

Each nine costs more than the last and removes less downtime in absolute terms. Going 99% to 99.9% buys back three days a year. Going 99.99% to 99.999% buys back 47 minutes a year — for a price that often exceeds both earlier jumps. At some point the cost of the next nine is wildly more than the downtime it prevents, and chasing it is a vanity project.

The stopping rule is simple: keep buying nines while the cost of the next one is less than the downtime cost it removes, and stop the moment it flips. That crossover point is different for a payments processor and a project-management tool, which is exactly why "as high as possible" is the wrong target. The right target is the nine where the next one stops paying for itself — and you can only find it once you've priced your own downtime.

What fixed looks like

You've priced an hour of downtime at roughly $150k. You set a target of 99.95% — derived, not guessed — because the math says the jump to 99.99% would cost more than the downtime it saves at your stage. You spend the first reliability dollars on the boring high-impact things: a database replica with automated failover, staged deploys with fast rollback, real monitoring that pages you before customers do, and circuit breakers on your two flakiest dependencies.

Six months later you have a hardware failure that, a year ago, would have been a four-hour outage and a $600k afternoon. This time failover happens automatically, the blip is ninety seconds, and most customers never notice. You spent a fraction of what an outage costs, you spent it on the right layer, and you can show an investor exactly why your target is the number it is. That's what buying uptime on purpose looks like.

This is for you if

You're a funded founder or engineering leader running a system where outages cost real money, and you've either been burned by an outage you can't afford to repeat or you suspect you're spending on reliability without knowing whether it's the right spending. You want a target you can defend and a roadmap that buys the cheapest nines first.

A reliability-strategy engagement runs $50k+: we price your downtime with you, set a defensible availability target, audit where your outages actually come from, and sequence the work so the cheapest nines land first — then implement them. For teams operating mission-critical systems where the target has to be continuously held and the risk surface keeps moving, we run it as a reliability retainer at $15k–$25k/mo.

This isn't for a pre-revenue product where an hour of downtime costs a rounding error — you don't need to buy nines you can't yet value, and a sensible managed stack is plenty. And it's not for teams that already did this math, set a target, and just want a config tune-up. It's for the team that's been answering "what's your uptime target?" with "as high as possible" and is ready to replace the vibe with a number.