It's 2am. The site is down, or slow, or doing something wrong with customer money. A founder sees it first — or worse, a customer does — and starts texting engineers one by one. The first two don't answer. The third wakes up, has no idea if it's their problem, and spends forty minutes just figuring out what is broken before anyone can fix it. By the time it's resolved, three people are awake, nobody slept, and there's no record of what happened so it'll go exactly this way next time.
This is the default state of every team under ten engineers, and the reason is understandable: you think incident response is a heavyweight thing for big companies with a PagerDuty bill and a dedicated SRE team. It isn't. The heavyweight version is for big companies. The lightweight version — which is mostly a few documents and one agreement — is for you, and the gap between having it and not having it is the difference between a 20-minute incident and a 2-hour one.
What the chaos actually costs
The cost of no incident process isn't just the downtime. It's everything stacked on top.
There's the time-to-clarity tax: the first 30 to 60 minutes of every incident spent figuring out who's responding, what's broken, and whether it's even your fault versus a vendor's. For a team that hits, say, one real incident a month, that's a recurring hour of your most expensive people thrashing before any actual fixing starts.
There's the burnout tax: when there's no on-call rotation, everyone is implicitly always on call. Every engineer half-watches Slack on weekends because they're not sure someone else will catch it. That ambient dread is a tax on your whole team's life, paid every day, not just incident days.
And there's the repeat tax: with no postmortem, the same outage recurs. The database fills its disk in March, you scramble, you fix it, and it fills again in July because nobody wrote down "set an alert at 80%." You're paying for the same incident two or three times.
Put a number on a single bad night: three senior engineers awake for three hours, plus the next day's lost productivity from sleep debt, plus the customer who churned over the outage. Call it a $10k night. The fix costs less than one of them and prevents most of the rest.
Severity levels so you stop overreacting and underreacting
Before anything else, agree on what counts as an incident and how bad it is. Without this, everything is either ignored or treated as a five-alarm fire. Three levels is enough for a small team:
SEV1 — customer-facing and severe. The product is down, data is being lost or corrupted, or money is moving wrong. Wake people up. Drop everything.
SEV2 — degraded but functioning. Something's broken but there's a workaround, or it affects a subset of customers. Respond promptly during waking hours; don't necessarily wake anyone at 3am.
SEV3 — annoying, not urgent. A non-critical job failed, a dashboard is wrong. File it, fix it in normal hours.
The point of severity levels isn't bureaucracy. It's permission — permission to not wake up for a SEV3, and permission to absolutely wake up for a SEV1 without second-guessing. That clarity is what lets people actually rest, because they know what will and won't reach them.
On-call that one person can hold
For a team under ten, on-call is one engineer at a time, on a weekly rotation, with a clear secondary as backup. That's the whole structure. The primary is the single known answer to "who responds," so nobody texts five people hoping one is awake — the alert goes to one phone, and that person either fixes it or escalates to the secondary.
Two non-negotiables make this humane. First, only SEV1 pages out of hours. If everything pages at 3am, on-call is punishment and people quit. Tune your alerts so the thing that wakes someone is genuinely worth waking for — and treat a false page as a bug to fix, not a cost of doing business. Second, on-call is real work, not extra work. The person holding the pager does lighter feature work that week; you don't expect a full sprint output from someone who might be up at 2am. Treat it as free overtime and your best engineers will route around it.
You don't need expensive tooling to start. You need one alerting integration that can call a phone, a rotation everyone can see, and an escalation rule. The tooling is the easy part. The agreement is the part that matters.
Runbooks: the difference between 20 minutes and 2 hours
A runbook is a short document that says: when this alert fires, here's what it probably means, here's how to confirm, here's how to fix it, here's how to escalate. That's it. Not a novel — a checklist the half-asleep on-call engineer can follow without thinking from scratch.
The highest-leverage runbook is the one for your most common failure. Database connections exhausted. Disk filling up. The third-party API that goes down monthly. Payment processor returning errors. For each, write the four lines: symptom, how to confirm, how to remediate, when to escalate. The on-call person at 2am should not be deriving the fix from first principles — they should be executing a known recovery while the system is on fire.
// runbook: db connections exhausted
symptom: 500s + "too many connections" in logs
confirm: check active connection count
remediate: restart the connection-leaking worker; if no relief, scale the pool
escalate: no recovery in 15 min → page secondary
Build these incrementally. Every time you have an incident, the postmortem produces the next runbook. Within a few months your most common failures all have one, and your incidents go from forensic investigations to checklist executions.
Blameless postmortems — the part that compounds
Every SEV1 and SEV2 gets a short writeup, within a day or two, while memory is fresh. Not to assign blame — to extract the lesson. The format is five lines: what happened, the timeline, the root cause, the impact, and the action items with owners and dates.
The word blameless is doing real work. The moment a postmortem becomes about who screwed up, two things happen: people stop volunteering what actually went wrong, and your incidents stay shallow ("Dave pushed a bad deploy") instead of deep ("we have no staging environment that catches this, and our deploy has no automated rollback"). The bad deploy is never the root cause. The system that let a bad deploy reach production without a safety net is the root cause, and you only find it in a room where nobody's afraid to say it.
The output that matters is the action items. A postmortem with no action items is a diary entry. A postmortem that produces "add a disk-usage alert at 80%, owner Sarah, by Friday" is what stops the March incident from recurring in July. This is the loop that makes your reliability improve over time instead of staying flat.
What to automate first
You don't automate everything. You automate the highest-frequency, lowest-judgment toil first.
Automate detection before anything else — the system should tell you it's broken before a customer does. Then automate the safe, repetitive remediations: restart a wedged worker, fail over to a replica, roll back the last deploy. Anything you've now done by hand three times following a runbook is a candidate to make a one-button action. Leave the high-judgment calls — "do we fail over the whole database" — to humans. The goal isn't a self-healing system; it's removing the routine 2am toil so the human's brain is free for the parts that actually need a brain.
What fixed looks like
Fixed is a 2am alert going to one known phone, that person opening the runbook for that specific alert, executing a known recovery, and being back asleep in 20 minutes — with a one-page postmortem the next day that produces an action item so it doesn't recur.
Fixed is severity levels that let people rest, an on-call rotation that's one humane week at a time, and alerts tuned so the only thing that wakes someone is worth waking for. Fixed is your team not half-watching Slack every weekend, because they know exactly what will reach them and what won't.
Fixed is incidents getting rarer and shorter over time, because every one feeds a runbook and a postmortem, and the loop compounds.
This is for you if
You're a funded team under ten engineers carrying real production load with no incident process, running on adrenaline and group texts. We set up lightweight incident response — severity levels, on-call rotation, alerting, the first runbooks, the postmortem habit — tuned to a small team, typically $25k+ to stand up the process and your highest-priority runbooks, $50k+ when it comes with an observability and alerting overhaul so the system actually detects its own failures. Ongoing reliability and on-call coverage from a senior team runs $100k+ annually if you want us holding the pager alongside you.
This is not for you if you already have a working rotation and runbooks and just had one bad week — one bad week isn't a process problem. It's not for you if you're 40 engineers and need a real SRE function with SLOs and error budgets, which is a heavier build than this. And it's not for you if the actual issue is that your system fails constantly because it's architecturally fragile — no incident process saves a system that needs ten responses a week. That's a different engagement, and we'll tell you so.