← Insights
maintain

Observability: What to Instrument Before You Have Load

Production is degrading and the team is guessing because nothing is instrumented. The three pillars at startup scale and what to wire up first.

Production is slow this week and your team is in a thread arguing about why. One person thinks it's the database. Another thinks it's a third-party API. A third is "pretty sure" it started after Tuesday's deploy. Nobody knows, because nobody can look. There's no dashboard that answers the question. There's no log you can query. There are just opinions, and the customer who opened the ticket is still waiting.

This is what flying blind feels like, and the trap is that it feels survivable right up until it isn't. At low load, you can SSH into the box and eyeball it. You can grep a log file. You can restart the service and the problem goes away and you tell yourself you'll figure it out later. Then your first real scaling event arrives — a launch, a big customer, a press hit — and the eyeball-and-restart approach stops working at exactly the moment it matters most.

The cost of guessing during your first scaling event

The expensive part of no observability isn't the day-to-day friction. It's the specific, predictable moment when load arrives and you can't see.

Your first scaling event is when latency starts climbing under real traffic. Without instrumentation, debugging it is archaeology — you're reconstructing what happened from customer complaints and gut feel, while it's actively happening and getting worse. A problem that a single trace would have located in ten minutes instead takes two engineers a full day, during the exact window when the product is making its first impression on the customers you fought hardest to get.

Put a number on it. A scaling-event outage during a launch costs you the launch — the conversion you were going to get from that traffic, which you don't get a second shot at. Add the engineering time: two senior engineers debugging blind for a day is roughly $2k in fully-loaded cost, and that's the cheap part. The expensive part is the deal that churned because your demo timed out, or the enterprise prospect who watched your dashboard hang during the eval.

Instrumentation is cheap insurance against an expensive event. The question is what to wire up — because over-instrumenting early is its own failure mode.

The three pillars at startup scale

Observability has three pillars: logs, metrics, and traces. At a 500-person company you run all three with a dedicated platform team. At a startup, you implement a deliberately thin slice of each. The goal is not coverage. The goal is being able to answer three questions fast: is it broken, where is it broken, and what exactly happened.

Logs answer "what exactly happened." The single highest-return change most early teams can make is moving from string logs to structured logs — JSON with consistent fields — and attaching a request ID to every log line in a request's lifecycle. The difference is the difference between grepping for a substring and running a query. Structured logs let you ask "show me every error for customer 4412 in the last hour, in order" and get an answer. String logs let you scroll.

// good
{ "level": "error", "request_id": "req_8f2a", "user_id": 4412, "msg": "payment_declined", "provider_latency_ms": 4200 }
// useless under load
"ERROR: payment failed"

Metrics answer "is it broken." These are the time-series numbers you watch on a dashboard. At startup scale you need a small, boring set: request rate, error rate, and latency percentiles (p50, p95, p99) per endpoint. Plus the saturation signals for your scarce resources — database connection pool usage, queue depth, memory. That's most of the value. You do not need 200 custom business metrics on day one. You need to know that the error rate jumped and the p99 doubled, and you need to know it without anyone telling you.

Traces answer "where is it broken." A trace follows a single request across every service and database call it touches, with timing on each span. This is the pillar teams skip because it's the most setup, and it's also the one that pays off hardest during a scaling event, because it's the only one that points directly at the bottleneck. When a request takes 4 seconds, a trace tells you whether 3.8 of those seconds were one slow query, an external API, or N+1 calls in a loop. The other two pillars tell you something is wrong. The trace tells you where.

What to instrument first

Order matters, because attention is the scarce resource. Here's the sequence that gets you the most visibility per hour spent.

First: errors, captured and searchable. Wire up error tracking that captures every unhandled exception with a stack trace, the request context, and a count of how often it's happening. This is the highest-return single thing you can do, and it's an afternoon of work. The moment an error tracker is live, you stop finding out about bugs from customers and start finding out about them from the tool — usually before the customer notices.

Second: structured logs with request IDs. Make every log line queryable and correlatable. This is what turns a 3am incident from "scroll through 40,000 lines" into "filter to this one request."

Third: the golden-signal dashboard. One dashboard, rate / errors / latency per endpoint plus your saturation metrics. It should answer "is the system healthy right now" at a glance. If a new engineer can't tell from one screen whether prod is on fire, the dashboard isn't done.

Fourth: tracing on the critical path. You don't need to trace everything. Trace the request paths that matter — checkout, the core API call, whatever the product lives or dies on. That's where the first scaling-event bottleneck will appear.

Notice what's not on this list: custom business analytics, per-feature event tracking, elaborate uptime SLA dashboards. Those are real, and they come later. Instrument for debugging before you instrument for reporting.

SLIs and SLOs that actually matter

An SLI (service level indicator) is a thing you measure. An SLO (objective) is the target you hold it to. Early teams either skip these entirely or over-engineer them into a compliance exercise. Neither is right.

Pick two or three SLIs that map to user pain. For most products that's: availability (fraction of requests that succeed), latency (fraction of requests faster than some threshold — say, 95% of requests under 300ms), and if you run background work, freshness (how stale the data the user sees can be). That's it. Three numbers that, if they're healthy, mean users are having a good time.

Then set honest objectives. "99.9% availability" sounds nice and is often a lie at startup scale. Pick a target you can actually hold and that you'd actually page someone for missing. The point of an SLO is not to look good in a deck — it's to define the line between "acceptable" and "wake someone up," so alerting has a basis other than vibes.

Alert fatigue is a load-bearing problem

The fastest way to make observability useless is to alert on everything. A team that gets 30 alerts a day stops reading alerts, and then the one alert that mattered scrolls past at 2am unread. Alert fatigue isn't an annoyance — it's the mechanism by which a fully-instrumented system fails to catch the outage it was built to catch.

The rule: alert on symptoms, not causes, and only on things a human must act on now. Alert on "error rate is above threshold" and "p99 latency doubled" — user-visible symptoms. Don't alert on "CPU is at 70%," because high CPU is not, by itself, a problem a human needs to fix at 2am. Every alert should be actionable: if the response to an alert is "ignore it, it clears itself," it should not be an alert. It should be a dashboard line, or it should be deleted.

Tune aggressively. An alert that fired and didn't need action is a bug in your alerting, and you fix it the same way you fix any bug.

What fixed looks like

The thread arguing about why prod is slow is over, because someone opened the dashboard, saw the p99 spike on one endpoint, pulled a trace for a slow request, and found the query. Ten minutes, not a day. The error tracker caught the regression before the customer did. Every log line is correlated to a request, so reconstructing an incident is a query, not an excavation.

When the scaling event arrives — and it will — you watch it happen on a screen instead of reconstructing it from complaints. The bottleneck announces itself. You fix the thing that's actually broken, not the thing three people are guessing about.

This is for you if

You're a founder or technical lead with a live product, a small team, and the dawning realization that you can't actually see what production is doing. You want to be instrumented before the scaling event, not in the smoking aftermath of one.

An observability foundation engagement runs $25k+: we wire up error tracking, structured logging, the golden-signal dashboard, tracing on your critical paths, and a tuned alerting setup that doesn't cry wolf — sized to your stack and your scale, not a Fortune 500's. The output is a system you can actually see into and a team that knows how to use it.

This is not for teams that already have solid instrumentation and just want a second opinion on their SLOs. And it's not for pre-launch products with no traffic — instrument when there's something to observe. It's for the team that's guessing about production right now and knows the next scaling event will be a lot worse than the last.