The demo was a triumph. The agent took a vague request, planned six steps, called four tools, and produced a result that made the room lean forward. Someone said "this changes everything." Two weeks later the same agent is in production. It loops on step three until it hits the iteration cap. It calls a tool that doesn't exist with arguments it invented. It spends $40 of inference deciding it can't do a task it was never able to do. The token budget for the month is gone by the 9th.
The demo and the production system share a name and almost nothing else. The demo ran one curated happy path under supervision. Production runs ten thousand adversarial paths unsupervised. The gap between them is not prompt tuning. It's architecture.
Where the demo lied to you
A demo agent is a closed loop with a benevolent operator. The inputs are clean. The tools always succeed. The model never gets confused because the human running the demo would have reset it if it did. Nobody measured cost because the demo ran four times.
Production removes every one of those conditions. Inputs are malformed, contradictory, and occasionally hostile. Tools time out, return errors, and return success with garbage payloads. The model gets confused constantly, and there's no human standing by to reset it. And every run costs money you can count.
The teardown below is what breaks, why, and what replaces the broken part.
Failure mode one: the agent that won't stop
The single most common production agent failure is the loop. The agent tries an approach, it doesn't work, it tries a slight variation, that doesn't work, it tries the first approach again. Without a hard cap it runs until something external kills it. With a naive cap it burns the full cap on every hard input.
The naive fix is an iteration limit. The real fix is multiple independent stop conditions:
- Iteration cap — a hard maximum on reasoning steps per task. Not 50. Closer to 6–10 for most real tasks. If a task genuinely needs 30 steps, it needs to be decomposed into sub-tasks, not granted a bigger budget.
- Cost cap — a per-task token ceiling enforced in code, not in the prompt. The model cannot be trusted to police its own spend. When the running cost crosses the ceiling, the orchestrator halts the loop regardless of state.
- No-progress detection — track whether each step changes state. Three consecutive steps that produce no new tool result or no new information mean the agent is spinning. Halt and escalate.
- Wall-clock cap — a task that's been running for 90 seconds is a task the user has already given up on.
These are orchestration-layer concerns. They live in your code, around the model, not inside the prompt. A prompt that says "don't loop" is a suggestion. A loop counter that throws at 8 is a guarantee.
Failure mode two: the hallucinated tool call
The model is handed a set of tools. In production it will call a tool that isn't in the set, call a real tool with arguments that don't match the schema, or call a real tool with plausible-looking arguments that are factually invented (an account ID it made up, a date it guessed).
The defense is validation at the boundary, and it has three layers.
// tool-call validation pipeline
1. schema check — does the tool exist? do args match the JSON schema?
2. semantic check — are the arg VALUES plausible? (IDs exist, dates in range)
3. authorization — is this caller allowed to invoke this tool with these args?
The schema check is non-negotiable and cheap. If the model calls delete_account when only read_account exists, that call never executes — it returns a structured error to the model, which then gets a chance to correct. If the model passes a string where an integer is required, same thing. The model treats validation errors as feedback and re-plans, which is exactly what you want.
The semantic check is where production systems separate from demos. A tool call to refund_order(order_id="ORD-99999") that's schema-valid but references an order that doesn't exist must fail at the boundary, not at the database. The validation layer checks existence and bounds before the side effect fires.
Authorization is the layer most teams forget until a security review finds it. The agent operates with some user's permissions. A tool call must be checked against those permissions every time, in code, not assumed because "the agent wouldn't do that."
Failure mode three: the irreversible action
A read-only agent that hallucinates is annoying. An agent with write access that hallucinates is a liability. The dividing line in production agent design is between actions that can be undone and actions that cannot.
The pattern that works: agents propose, humans dispose — for the actions that matter. Not every action. Reading a record, running a search, summarizing a document — let the agent do those unsupervised. Sending an email to a customer, issuing a refund, modifying a production record, spending money — those route through a human-in-the-loop checkpoint.
The checkpoint is not a modal that says "are you sure." It's a structured approval surface: here's what the agent wants to do, here's the inputs it's acting on, here's the reversibility class of the action. The human approves, edits, or rejects. The agent continues from the approved state.
The economics work because the agent does 95% of the work — the planning, the retrieval, the drafting — and the human spends two seconds approving the 5% that's irreversible. You get most of the leverage of full autonomy with almost none of the catastrophic-action risk.
Where agents genuinely earn their keep — and where they don't
Here is the uncomfortable truth most agent enthusiasm skips: a large fraction of "agentic" use cases are better served by a deterministic pipeline.
An agent is the right tool when the task is open-ended, the steps are not knowable in advance, and the input space is genuinely unbounded. Research-and-synthesize across heterogeneous sources. Triage an ambiguous support ticket and decide which of twelve workflows it belongs to. Investigate an anomaly where the next step depends on what the last step found.
A deterministic pipeline is the right tool when the steps are knowable, even if there are many of them. Extract fields from an invoice, validate them, write them to a system. Classify a document and route it. Take a structured input and produce a structured output through known transformations. If you can draw the flowchart, build the flowchart. A flowchart with an LLM call at each decision node is faster, cheaper, more debuggable, and more reliable than an agent improvising the same flow on every run.
The expensive mistake is using an agent for a deterministic task because agents are exciting. You pay for the model to re-derive the same plan ten thousand times, you inherit every failure mode above, and you get worse reliability than a pipeline you could have written in a week.
The architecture that ships most often is a hybrid: a deterministic pipeline for the known path, with an agent invoked only at the genuinely ambiguous decision points. The pipeline is the skeleton; the agent is a specialist consultant called in when the skeleton doesn't know what to do.
Cost and latency: the budget nobody set
Agents fan out. One user request becomes a planning call, three tool calls, a synthesis call, a validation call. Six model invocations where a non-agentic feature would have made one. At production volume, this is the line item that surprises the finance team.
Three controls keep it bounded:
- Model routing inside the agent. The planning step might need your strongest model. The step that formats a tool result into prose does not. Route the cheap steps to a cheap model and reserve the expensive model for the steps that actually need reasoning.
- Aggressive context trimming. An agent's context grows every step as tool results accumulate. By step eight the prompt is enormous and you're paying for it on every subsequent call. Summarize or drop stale tool results that are no longer relevant to the current step.
- Caching the stable prefix. The system prompt and tool definitions are identical on every call within a task. Prompt caching turns that repeated prefix from full price into a fraction of it. For multi-step agents this is one of the largest single savings available.
Latency follows the same fan-out problem. Six sequential model calls at 1.5 seconds each is a nine-second wait. Where steps are independent, parallelize them. Where the user is watching, stream the agent's progress — "searching records… found 3… drafting response" — so the wait is legible rather than a frozen spinner.
What fixed looks like
A production agent that earns its place looks boring from the inside, which is the point. Every tool call is validated at the boundary before any side effect fires. Loops terminate on iteration, cost, no-progress, and wall-clock conditions enforced in code. Irreversible actions route through a human checkpoint; reversible ones run unsupervised. Cheap steps run on cheap models, the system prefix is cached, and stale context is trimmed. Cost per task has a hard ceiling that the model cannot exceed no matter how confused it gets.
The result is an agent that fails safely, costs predictably, and does real work on the open-ended tasks where a pipeline can't. The demo impressed the room. This one survives the month.
This is for you if
You're a funded founder shipping an agentic feature into a product real customers pay for, and the difference between "impressive in the demo" and "trustworthy in production" is now your problem. Building a production agent — orchestration with hard stop conditions, three-layer tool-call validation, human-in-the-loop checkpoints, model routing, and cost ceilings — typically runs $50k–$150k depending on how many tools the agent integrates with and how many actions are irreversible enough to need checkpoints. A hybrid system with a deterministic backbone and agentic decision points runs higher, $100k+, because the deterministic path is real engineering too.
This is not for you if you need to validate that the feature is worth building. Build the demo, run the happy path, get the conviction. Then come back and build the version that doesn't burn the budget by the 9th. It's also not for you if your task is actually deterministic — in that case you don't need an agent at all, and the most senior thing we can tell you is to build the pipeline instead.