Evaluating LLM Output Quality in Production

The model seems fine. That's the entire quality assurance story for most production AI features, and it's a liability. "Seems fine" means someone glanced at a handful of outputs last week and nothing looked broken. It means nobody knows what the actual error rate is. It means when quality regresses — and it will, the day you tweak the prompt, swap the model, change the retrieval, or the input distribution shifts under you — you find out from a customer, not a dashboard.

Traditional software has tests. You change code, the suite runs, red means stop. LLM features ship with none of that, because the output is non-deterministic prose and people assume you can't test prose. You can. You just have to build the harness, and the teams that don't are flying a feature they can't measure into production they can't defend.

The eval set is the foundation

Everything downstream depends on having a fixed set of representative inputs with known-good expectations. No eval set, no measurement — you're just vibing on recent outputs.

Build it from reality, not imagination. Pull real inputs from production: the common cases, the edge cases, the ones that have failed before, the adversarial ones users actually send. For each, define what good looks like — sometimes an exact answer, more often a set of criteria the output must satisfy (contains the right fact, cites the right source, doesn't hallucinate a number, stays in the required format). A few hundred well-chosen cases beats ten thousand random ones. The eval set is a curated asset, and it grows: every production failure becomes a new case so the same bug can't ship twice.

This set is what turns "seems fine" into a number. Run a change against it and you get a score you can compare against the score before the change. That comparison is the whole point.

LLM-as-judge, with guardrails

You can't human-grade hundreds of outputs on every change — too slow, too expensive. So you use a model to grade model outputs: feed the judge the input, the output, and the criteria, and have it score each one. This scales evaluation to the speed of CI. It's also a loaded gun if you point it carelessly.

The guardrails are what separate a useful judge from a number that lies to you.

Grade against explicit, checkable criteria, not vibes. "Is this answer good" produces noise. "Does this answer contain the cancellation window stated in the source document, cite that document, and avoid inventing a fee" produces a signal, because each clause is independently verifiable. Vague rubrics make the judge inconsistent; specific ones make it reliable.

Give the judge an independent context. The model grading the output should reason fresh against the criteria, not rubber-stamp the generation. A separate evaluation pass with its own framing catches what a self-grade rationalizes.

Validate the judge against humans. This is the step everyone skips and it's non-negotiable. Periodically have a person grade a sample the judge already graded, and measure agreement. If the judge and the human diverge, the judge is miscalibrated and every number it's produced is suspect. The judge is a measurement instrument; an instrument you've never calibrated is a decoration.

Know what the judge is bad at. LLM judges are weak on subjective quality, factual correctness outside the provided context, and anything requiring genuine domain expertise. Use the judge for what it's reliable at — format adherence, presence of required elements, obvious failures, criteria you can spell out — and route the rest to humans. A judge applied past its competence produces confident, wrong scores, which is worse than no scores.

Regression detection in the loop

The eval set plus the judge gives you a gate, and the gate belongs in the deployment pipeline. Every change that can affect output — prompt edits, model swaps, retrieval changes, temperature of any knob — runs the eval set and compares scores against the current baseline. A meaningful drop blocks the change, the same way a failing unit test blocks a merge.

This is the mechanism that catches the silent regression — the prompt tweak that fixed one case and quietly broke five others, the model upgrade that improved reasoning but changed the output format your parser depends on. Without the gate, those ship and you learn about them from support tickets. With it, they show up as a red number before merge.

Model swaps deserve special care because the failure is sneaky. A newer model can be genuinely better and still score worse on your evals — because it follows your "only report critical findings" instruction more literally, or calibrates verbosity differently, or escapes JSON another way. That's a harness artifact, not a regression, but you only know which it is because you have an eval set to diff against. Never swap the model in production on the strength of a benchmark someone else ran. Gate it on yours.

Human review where it counts

Automated eval is the wide net; human review is the deep look, and you need both. The trick is sampling intelligently so humans spend their attention where it pays.

Review the cases the judge is unsure about (low or borderline scores). Review a random sample of high-confidence passes, because that's where the judge's blind spots hide — the outputs everyone assumed were fine. Review every case in high-stakes categories regardless of score: anything touching compliance, money, medical, legal, or anything a wrong answer makes a liability. And close the loop — every human correction goes back into the eval set as a new case, so the judge learns the boundary it missed and the system gets harder to fool over time.

Drift, because nothing holds still

A system that scored well at launch will not score well forever, and the cause is often nothing you changed. Inputs drift — users ask new things, in new ways, about new topics your eval set never covered. Provider models update under you. The world the data describes moves. Quality erodes without a single deploy.

Monitor the live system, not just the pre-deploy gate. Sample production outputs continuously and run them through the judge, so you catch quality slipping in real traffic before users do. Watch the input distribution — when incoming requests stop looking like your eval set, your evals have gone blind and need new cases. Track refusal rates, format-failure rates, fallback rates, latency: these are leading indicators that move before the explicit quality score does. The goal is the same as everywhere in production — learn from a ./automate quality monitor in an hour, not from a churned customer in a month.

What fixed looks like

You can state your feature's error rate as a number, not a feeling. A curated eval set built from real traffic gates every output-affecting change, and a regression blocks the merge. An LLM judge scores at CI speed against explicit criteria, and it's calibrated against humans on a regular cadence so you trust it. Humans review the uncertain, the high-stakes, and a random slice of the confident — and every correction sharpens the eval set. Live outputs are sampled for drift, and quality erosion pages someone before it reaches a customer. When you swap the model, you gate on your evals, not a stranger's benchmark.

This is for you if

You're a funded US company with an AI feature carrying real weight — customer-facing, revenue-attached, or compliance-adjacent — and right now your quality story is "seems fine." Evaluation infrastructure of this kind is part of an AI build or audit, typically $50k+; designing the eval harness, judge, regression gate, and drift monitoring into a larger product runs $100k+.

It's not for you if the feature is internal, low-stakes, and a wrong answer costs nothing — full eval infrastructure is overkill for a toy. It's not for you if you won't maintain the eval set; a frozen eval set rots as the product moves and becomes a number that lies. And it's not for you if you're still pre-launch with no real inputs to curate from — build the harness once you have traffic worth measuring.