LLM Cost Control: Token Budgets and Caching

The feature shipped. Adoption is good. Then the monthly bill arrives and inference is line item number two, right behind payroll. Nobody modeled it. The proof-of-concept ran on a handful of internal users and cost $12 a day, so it never showed up on anyone's radar. Now it's 40,000 real users, each one firing requests you never counted, and the number on the invoice has four more digits than the PoC did. Finance wants a forecast. You don't have one, because the system was built to answer questions, not to account for what answering them costs.

Inference cost is not a tax you pay for using a model. It's an output of architecture decisions — how much context you stuff into each prompt, whether you cache, which model handles which request, how long you let the model talk. Every one of those is a lever. Most teams pull none of them, then act surprised.

Token accounting comes first

You cannot control a number you don't measure. The first move is per-request token accounting: input tokens, output tokens, cache-read tokens, cache-write tokens, tagged by feature and by user. Not aggregate monthly spend — per request, attributable.

Do not estimate tokens with a generic tokenizer. The OpenAI-style tiktoken undercounts Claude tokens by 15 to 20 percent on plain English, and far more on code or non-English text. Count against the model you actually call — the count_tokens endpoint is model-specific and free. Build a thin wrapper that logs usage off every response: input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens. That usage block is the ground truth. If your dashboard says the agent ran for an hour on 4,000 input tokens, you forgot to sum the cache-read column — total prompt size is the sum of all three, not the uncached remainder alone.

Once every request is attributed, the expensive shapes announce themselves. It is almost always a long-context retrieval feature run on every keystroke, or a summarization job re-embedding the same corpus nightly, or one enterprise customer hammering a bulk endpoint nobody rate-limited.

Caching is the biggest single lever

Prompt caching is a prefix match. The model caches the rendered prompt up to a marked breakpoint, and any byte change anywhere in that prefix invalidates everything after it. Cache reads cost roughly 0.1x base input price. That is a 90 percent discount on every token you can hold stable.

The economics are blunt. A 5-minute cache write costs 1.25x; a read costs 0.1x. Two requests against the same prefix and you're ahead (1.25 + 0.1 = 1.35x versus 2x uncached). For a chatbot replaying a 6,000-token system prompt and growing conversation history on every turn, the cache pays for itself by turn two and every turn after is near-free on the prefix.

The discipline that makes it work: freeze the prefix. The single most common reason cache_read_input_tokens comes back zero across identical-looking requests is a silent invalidator sitting at the front of the prompt — a datetime.now() in the system header, a per-request UUID, a JSON blob serialized without sorted keys, a tool list that varies per user. Render order is tools, then system, then messages. Put everything stable first, everything volatile after the last breakpoint, and verify with cache_read_input_tokens on a live request. If it's zero, you have a bug, not a cache.

Semantic caching is the second tier and a different mechanism. Prompt caching keys on exact prefix bytes; semantic caching keys on meaning. Embed the incoming query, check a vector store for a prior question within a similarity threshold, and if it hits, return the stored answer without calling the model at all. This works for FAQ-shaped traffic where users ask the same thing a thousand different ways. It is dangerous for anything where freshness or per-user specificity matters — cache "what's the weather" and you'll serve yesterday's forecast. Scope it to queries whose answers don't change, and set the similarity threshold conservatively. A false cache hit is a wrong answer delivered with full confidence.

Route cheap-first, escalate on demand

Not every request needs your most capable model. A classification call, an intent detection, a short rewrite — these run fine on a small fast model at a fraction of the cost. A frontier model at $5 input / $25 output per million tokens is the wrong tool for deciding whether a support ticket is about billing.

Build a router. The default path sends a request to the cheapest model that can plausibly handle it. The expensive model is the escalation, not the baseline. Two patterns earn their keep:

Tiered by task type. Map each feature to a model up front. Intent routing and classification go to the small model. Long-form generation and multi-step reasoning go to the large one. This is static, predictable, and trivial to reason about for cost forecasting.

Cascade with a confidence gate. Run the cheap model first. If it returns a low-confidence result — or fails a cheap validator — re-run on the expensive model. Most requests resolve at the cheap tier; only the hard ones pay the premium. The math works when the cheap tier handles the majority, which it usually does, because most production traffic is mundane.

The trap is escalating everything. If your confidence gate fires on 80 percent of requests, you're paying for two calls instead of one and you've made things worse. Measure the escalation rate. If it's high, your cheap tier is wrong for the task or your gate is miscalibrated.

Control the output length

Output tokens cost five times input tokens on a frontier model. A response that rambles for 800 tokens when 200 would do is a 4x overcharge on the expensive half of the bill, on every single call.

Cap it. Set max_tokens to what the task actually needs — classification gets 256, not 16,000. Instruct the model to be terse in the system prompt when terse is correct. For structured extraction, constrain the output to a schema so the model returns fields, not prose around fields. The newest models calibrate response length to perceived task complexity and narrate more by default; if that narration is costing you, a one-line "respond with the answer only, no preamble" instruction reclaims it without touching quality.

Watch the reasoning budget too. Models with adaptive thinking decide how much to think per request, and on hard tasks they think a lot. That's correct when correctness matters and wasteful when it doesn't. The effort parameter is your dial — lower it on latency-sensitive, low-stakes routes and reserve high effort for work where a wrong answer costs more than the tokens.

Batch what isn't interactive

If a request doesn't need to come back in the same second, it shouldn't pay interactive prices. Bulk processing — embedding an entire customer corpus, nightly summarization, backfilling classifications — runs through the batch API at half the standard rate. Most jobs complete within an hour, with a 24-hour ceiling.

Separate your volumes. Interactive traffic gets the synchronous path with tight latency budgets. Everything asynchronous gets queued, batched, and run during off-peak windows at 50 percent off. Teams that route bulk work through the interactive endpoint because it was already there are leaving half the bulk bill on the table.

Hard ceilings, not hope

Every cost control above is a steady-state optimization. None of them save you from a runaway loop or an abusive caller. For that you need ceilings that the system enforces, not guidelines it tries to follow.

Per-user and per-tenant token budgets, tracked in real time, enforced at the gateway. When a user crosses their daily allowance, requests are rejected or downgraded — not silently served. Per-feature global ceilings so one misbehaving endpoint can't drain the month. Circuit breakers on the agentic paths: an agent that loops can burn $40 of inference deciding it can't do a task, so cap iterations and cap cumulative spend per task. The newest models expose a task-budget mechanism that tells the model how many tokens it has for a full loop and lets it self-moderate against a running countdown — useful, but it's a suggestion the model sees, not an enforced cap. Keep a hard max_tokens ceiling underneath it that the model never sees.

And alert on the derivative, not just the level. A bill that doubles overnight is a bug — a retry storm, a cache that stopped hitting, a new client integration that forgot to paginate. You want to know in an hour, from a ./automate cost-monitor that watches per-request spend, not in 30 days from an invoice.

What fixed looks like

Inference cost is a line in the financial model with a forecast next to it, accurate within 15 percent. Every request is attributed to a feature and a user. The cache hit rate on stable-prefix traffic is above 80 percent and you can prove it from cache_read_input_tokens. Cheap models handle the majority of calls; the frontier model is the escalation. Bulk work runs through batch at half price. Per-tenant ceilings reject abuse instead of absorbing it. A cost spike pages someone within the hour. When traffic triples, the bill scales predictably and nobody is surprised.

This is for you if

You're a funded US company that shipped an AI feature, has real usage, and just watched inference become a top-three operating cost with no forecast to defend it. Cost-architecture work of this kind sits inside a production AI build or audit, typically $50k+; designing token accounting, caching, routing, and enforced ceilings into a larger system build runs $100k+.

It's not for you if you're pre-launch with no traffic — you have no cost shape to optimize, and premature caching is just a cache you'll have to invalidate. It's not for you if your inference bill is a genuine rounding error against revenue; spend your engineering on the thing that's actually expensive. And it's not for you if the real problem is that the feature shouldn't exist — no amount of token discipline fixes a feature nobody uses.