You need to rename a column, split a table, or change a type. The migration tool will do it in one command. The problem is that one command takes a lock, and on a table with eighty million rows that lock means every write blocks for the duration — anywhere from forty seconds to forty minutes — and during that window your application is, for all practical purposes, down. So now you're scheduling a "maintenance window" at 2am Sunday, emailing customers, and praying the migration finishes before traffic returns.
For a B2B product with customers in multiple time zones, there is no 2am. Someone is always working. A maintenance window isn't a neutral inconvenience; it's lost revenue, a broken trust signal, and a precedent that says your platform goes dark when you change it. For software built for the worst day, "we have to take it down to fix it" is the thing you designed away.
The good news: nearly every schema change can be done with zero downtime. It just can't be done in one step. Here's the playbook we run.
The core idea: expand, migrate, contract
The reason a naive migration needs downtime is that it tries to change the schema and the data and the application code all at the same instant. The old code expects the old schema; the new code expects the new schema; and for the moment of the migration, reality has to be exactly one of those. That's a synchronized cutover, and synchronized cutovers need everything to stop.
Expand/contract breaks that single dangerous step into a sequence of individually safe ones, built on one rule: at every point in time, the database schema is compatible with both the currently-running code and the code about to deploy. If old and new code can both run against the schema as it exists right now, you never need to stop anything. You roll forward one compatible step at a time.
It has three phases:
Expand. Add the new structure alongside the old. Additive only — a new nullable column, a new table, a new index. Nothing existing is touched, so all running code keeps working.
Migrate. Move the data and shift the application to use the new structure, while keeping the old one populated. Both shapes are valid and both are maintained.
Contract. Once nothing reads or writes the old structure, remove it. By now it's dead weight, and dropping it is safe because no code depends on it.
The cost is that one logical change becomes three-to-five deploys spread over hours or days. That's the trade: a sequence of boring deploys instead of one terrifying window. We take boring every time.
Worked example: changing a column
Say you need to change users.full_name (one text field) into first_name and last_name. The one-step version locks the table and breaks every running instance the moment it lands. Here's the same change with nobody noticing.
1. expand add nullable first_name, last_name (old code unaffected)
2. dual-write code writes BOTH full_name AND the new pair
3. backfill populate first/last for existing rows in batches
4. migrate-reads code reads from first/last, ignores full_name
5. contract drop full_name (now unreferenced)
Step 2 is the load-bearing one. From the deploy that introduces dual-writes onward, every new and updated row has correct data in both the old and the new columns. That means at no point is there a row the running code can't read correctly, regardless of which version of the code touches it. The backfill handles the rows that existed before dual-writes started; dual-writing handles everything after. Together they guarantee complete, correct data in the new columns before any code starts depending on them.
Each step is independently deployable and independently reversible. If step 4 surfaces a bug, you roll the read path back to full_name, which is still populated and still correct, and nobody outside the team ever knew.
Backfilling without melting the database
The backfill — populating the new structure for rows that already exist — is where teams reintroduce the downtime they just worked to avoid. The naive version is a single UPDATE across the whole table, which locks millions of rows, holds one enormous transaction, and bloats the database while blocking live writes. You replaced a schema lock with a data lock.
Backfill in batches. Process a few thousand rows per transaction, commit, pause briefly, and repeat — driven off the primary key so each batch is an indexed range, not a scan. A batched backfill is interruptible (crash at batch 4,000 of 20,000 and you resume from there), it never holds a long lock, and the pause between batches keeps replication lag and live write latency under control. It takes longer in wall-clock time. It costs nothing in availability, which is the entire point.
The rule: the backfill runs behind live traffic, never in front of it. It should be invisible on the latency graph. If your backfill spikes P99, it's running too aggressively — slow it down. Time is cheap; an outage is not.
Building indexes online
Adding an index is the migration that looks free and isn't. A plain CREATE INDEX takes a lock that blocks writes to the table for the entire build, which on a large table is exactly the downtime you're avoiding everywhere else.
Postgres gives you CREATE INDEX CONCURRENTLY, which builds the index without blocking writes. It's slower and it makes two passes over the table, but it lets the application keep writing the whole time. There are sharp edges — CONCURRENTLY can't run inside a transaction block, and a failed build leaves an invalid index you must drop and rebuild — which is exactly why this belongs in a deliberate playbook and not in whatever your migration framework does by default. A lot of frameworks wrap every migration in a transaction, which silently prevents the concurrent build. Knowing that ahead of time is the difference between a clean rollout and an accidental lock at peak hours.
The constraint trap
The same lock danger hides in constraints, and it's the one that catches careful teams. Adding a NOT NULL or a foreign key the obvious way makes Postgres scan and lock the whole table to validate every existing row. On a big table, that's a long lock at the worst time.
The two-step pattern: add the constraint as NOT VALID first, which is a fast metadata-only change that starts enforcing the rule for new writes immediately without scanning anything. Then run VALIDATE CONSTRAINT separately, which checks the existing rows under a much weaker lock that doesn't block writes. Same end state — a fully enforced constraint — reached without ever locking the table against live traffic. The shape is the same philosophy as everything else here: split the scary atomic operation into a fast safe part and a slow non-blocking part.
The failure modes, named
Knowing where this goes wrong is most of running it safely.
Skipping dual-writes. Backfilling and then cutting reads over without writing to the new structure in between means every row created during the migration is missing from the new columns. You ship, and the newest data — the data that matters most — is silently wrong. Dual-write first, always.
Long-running transactions. A migration that opens a transaction and holds it blocks vacuum and can wedge replication. Keep migration transactions short; batch everything.
Irreversible steps too early. The contract phase — dropping the old column — is the one step you can't cleanly undo. Run it last, days after the new path has proven itself in production, not in the same deploy that introduced the new path. Once the data's gone, your rollback is gone with it.
The framework fighting you. Default migration tooling wraps things in transactions, picks blocking lock modes, and runs backfills inline. Treat every generated migration as a draft to be reviewed against these patterns, not as something safe to run because the tool produced it.
What fixed looks like
A schema change ships as a sequence of small, reversible, individually-deployable steps, and the maintenance email never gets sent. The expand phase adds new structure additively. Dual-writes keep old and new shapes consistent from the first deploy onward. A batched backfill fills the history behind live traffic, invisible on the latency graph. Reads cut over only once the new data is provably complete, with the old path still intact as an instant rollback. The contract phase removes the old structure days later, after the new one has earned trust.
Indexes build concurrently. Constraints validate without locking. No 2am window, no customer email, no held breath. The database changes shape underneath a system that never stops serving — which is the only acceptable answer when someone is always working. We've built migration pipelines like this for multi-tenant systems where a single locked table would have meant every customer down at once. How we architect data layers built to evolve →
This is for you if
You're running a funded production system with real data volume and customers who don't all sleep at the same time, and you've hit the point where schema changes are scary — where "we need a maintenance window" has entered the vocabulary and you don't want it to stay. You want to evolve the database without ever taking it offline.
A migration-safety engagement runs $50k+: we build the expand/contract playbook into your deploy process, retrofit batched backfills and online index/constraint patterns, and run the next few high-risk migrations with you so the team owns the muscle memory. For a system mid-flight toward a re-shape of the core data model — splitting tables, changing keys, large-scale type changes — that program runs $100k+.
It's not for you if you're pre-launch with a small dataset and no live traffic — at that stage you can take the lock, eat the two seconds, and move on. It's for the team where a locked table means lost revenue and a trust hit, and "we have to take it down to change it" is no longer an acceptable sentence.
// migrate live or don't migrate
< transmit >