Backups and Disaster Recovery That Actually Work (Not the Kind You Pray Over)

You have backups. The dashboard says so. There's a green checkmark, a nightly job, a retention policy with a number on it. Everybody on the team would tell an investor, with a straight face, that the data is safe.

Now answer one question: when did you last restore one? Not "verify the backup completed." Restore it. Stand up a fresh database from the backup file and confirm the application boots, the data is intact, and the numbers reconcile. For most teams the honest answer is "never." Which means you don't have backups. You have files of unknown contents that you are emotionally attached to.

A backup you've never restored is a prayer, not a strategy. It works exactly as well as praying does — sometimes the universe is kind, and sometimes you find out at 3am that the backup has been silently writing zero-byte files for six weeks because a credential rotated and nobody wired up the alert. The whole point of disaster recovery is that you don't find out on the worst day. You find out on a calm Tuesday, on purpose, in a controlled test.

Here's how we build recovery you can actually bet the company on.

Start with two numbers: RPO and RTO

Before you touch a tool, you write down two numbers, because they drive every decision after.

RPO — Recovery Point Objective — is how much data you can afford to lose. It's the gap between your last good backup and the moment disaster struck. If you back up nightly and the database dies at 5pm, your RPO is "everything since midnight" — up to 17 hours of orders, signups, and writes, gone. For a marketing site, fine. For a payments ledger, that number ends the company. RPO is a business decision wearing an engineering costume: how many minutes of lost writes is survivable?

RTO — Recovery Time Objective — is how long you can be down while you recover. It's the wall-clock time from "disaster" to "serving traffic again." If your restore procedure is "page the one person who knows how, hope they're awake, and watch a 400GB dump replay for six hours," your RTO is most of a day. If a thousand B2B customers each lose, say, $4,000 an hour while you're dark, a six-hour RTO is a $24M line item per incident.

You set these two numbers per dataset, deliberately, with the business in the room. Then every choice that follows — backup frequency, replication topology, how much you spend — is just buying your way down to the RPO and RTO you committed to. No numbers, no plan. Just vibes and a green checkmark.

Tested restores, or it didn't happen

This is the part everyone skips and the part that matters most. A backup is a hypothesis. A tested restore is evidence. The gap between them is where companies die.

We make restores a recurring, automated drill — not a heroic one-time event. On a schedule, a pipeline pulls the latest backup, restores it into a clean, isolated environment, boots the application against it, and runs a battery of checks: does the app start, do row counts match expectations, do a handful of known records reconcile, do foreign keys resolve. If any check fails, it pages someone — now, on a Tuesday, when fixing it is annoying instead of catastrophic.

restore-drill (scheduled):
  pull latest backup
  restore -> isolated env
  boot app, run smoke checks
  assert row counts / checksums / sample records
  on failure -> page on-call (calm Tuesday, not 3am Saturday)

The drill also produces the one number nobody has until they measure it: how long a restore actually takes. Teams routinely assume their RTO is "an hour or two" and discover, the first time they're forced to do it for real, that a cold restore of their largest table takes five hours and the indexes take three more. Measure it before the disaster, not during.

When we run the first restore drill on an inherited system, it fails more often than it succeeds. Wrong region, corrupt dump, a backup of the schema but not the data, a backup of one database in a system that quietly grew to four. That failure is the deliverable. Finding it on a drill is the entire reason the drill exists.

Point-in-time recovery: rewind to the second before

Nightly snapshots have a ceiling: your best-case RPO is "since the last snapshot." For anything where losing hours of writes is unacceptable, you need point-in-time recovery (PITR), and it changes the math entirely.

PITR works by keeping a base backup plus a continuous stream of every change since — the write-ahead log, the binlog, the transaction log, depending on your engine. To recover, you restore the base and then replay the change log up to a precise moment: 2:47:03pm, one second before the bad migration ran. Your RPO drops from "up to a day" to "seconds," and you gain something snapshots can't give you — the ability to rewind to just before a logical disaster.

Because the real disaster usually isn't hardware. It's a developer running DELETE without a WHERE clause, or a migration that mangles a column across every row, or a bad deploy that corrupts data for three hours before anyone notices. A snapshot from last midnight loses a day to recover from a mistake made at 2pm. PITR lets you rewind to 1:59pm and lose nothing but the mistake. That capability is the difference between "we had an incident" and "we had an extinction event."

The runbook: a recovery nobody has to be clever to run

A recovery plan that lives in one engineer's head is not a plan. It's a single point of failure with a pulse, and that person is on a plane the day you need them.

The runbook is the recovery procedure written down so completely that a competent engineer who has never seen your system can execute it under pressure. Exact commands. Exact order. Where the backups live and how to authenticate. How to flip DNS or the connection string to the recovered database. How to verify you're actually recovered before you declare victory. Who to tell, in what order, and what to put in the status page.

Good runbooks are written for the worst version of the person reading them: exhausted, scared, three hours into an outage, with leadership asking for updates every ten minutes. No cleverness required, no judgment calls, no "you'll figure it out." Just steps. And the runbook isn't trustworthy until the restore drill has executed it end to end — because a runbook you've never followed is fiction with good formatting.

The failure modes of untested backups

Here is the catalog of ways "we have backups" turns out to be false, every one of which we've found in the wild:

The silent failure. The job broke weeks ago. The success alert was never wired, or it was, and the alert was suppressed during a noisy incident and never re-enabled. Newest "backup" is from last quarter.
The unrestorable file. The backup completes, the file exists, and it's corrupt, truncated, or encrypted with a key nobody can find. Looks like a backup. Restores into garbage.
The partial backup. You backed up the primary database. The system also grew a cache that became a source of truth, a second service with its own store, and a bucket of user uploads — none of which are in the backup. You restore and half the product is missing.
The co-located backup. The backups live in the same account, same region, same blast radius as production. The event that takes out production — a compromised account, a region failure, a fat-fingered delete — takes the backups with it. Backups must live somewhere the disaster can't reach.
The slow restore. The backup is perfect and it takes nine hours to replay. Your RTO was supposed to be one hour. You blew it by 800% and nobody knew until the clock was running.

Every one of these is invisible until you test, and obvious the moment you do.

What fixed looks like

A migration goes wrong at 2:47pm on a Wednesday and silently corrupts a core table. A customer notices first, files a ticket, and your team opens the runbook. They restore the database to 2:46:30pm using point-in-time recovery, into the procedure they've drilled a dozen times, on backups that live in a separate account the migration couldn't touch. The restore takes 40 minutes — the number they measured last month, not a number they're guessing at. The reconciliation checks pass. They flip the connection string, verify, and announce all-clear.

Total data loss: the thirty seconds of writes between the last clean state and the corruption. Total downtime: well inside the RTO they committed to. Nobody improvised. Nobody was a hero. The disaster was a bad afternoon, not a near-death experience, because the recovery had been rehearsed until it was boring.

This is for you if

You're a founder or engineering leader running a system where the data is the company — a ledger, a system of record, customer data you're contractually on the hook for — and you have backups you've never actually restored. You want to know, with evidence rather than faith, that you can recover.

A disaster-recovery engagement runs $50k+: we set RPO and RTO targets with you per dataset, implement point-in-time recovery, build automated restore drills that prove the backups work, write the runbook, and move backups out of production's blast radius — then run a full recovery against your real data to prove the whole thing end to end. For teams where recoverability is a standing requirement, we hold it as a reliability retainer at $15k–$25k/mo, drilling restores on a schedule so the capability never quietly rots.

This isn't for a pre-launch product with no data worth recovering and no users to lose — a managed database with default snapshots is plenty until you have something to protect. And it's not for teams that already run tested, automated restores and just want a second opinion on retention. It's for the team that has backups, has never restored one, and has finally admitted that the green checkmark is a prayer.