Someone on the team read that the big AI companies fine-tune their models, and now there's a Jira ticket to fine-tune yours. The reasoning goes: our domain is specialized, the base model doesn't know our jargon, so we'll train it on our data and it'll get smart about our world. It sounds rigorous. It's usually wrong, and it's almost always the wrong first move.
Fine-tuning is the heaviest, most expensive, least reversible of the three ways to make a model do what you want. It's where you go when the lighter tools have run out, not where you start. The instinct to reach for it first comes from a misread of what fine-tuning actually does. It doesn't teach a model facts. It teaches a model behavior — a format, a style, a way of responding. If your problem is "the model doesn't know our data," fine-tuning is the wrong solution to the right problem. Most teams who think they need to fine-tune need to prompt better or retrieve better.
Start with prompting
Prompting is the default and it solves more than people expect. A well-constructed system prompt, a few in-context examples, and a clear instruction set will handle a large fraction of what teams reach for fine-tuning to do. The newest models follow instructions closely and literally — closer than the models that shaped the "you must fine-tune" folklore. The behavior you want is often one or two well-chosen examples away.
Prompting wins when: the task is expressible in instructions, the knowledge the model needs is either already in its training or fits in the context window, and you can iterate in minutes instead of training runs. There's no training pipeline, no dataset to curate, no model to host and version. You change a string and you're done. Cost is per-token inference, and prompt caching collapses the cost of a large stable system prompt to near-nothing on repeated calls — a frozen prefix reads at roughly a tenth of base input price.
The reason to exhaust prompting first isn't laziness. It's that prompting is the only one of the three with a fast feedback loop. You learn what the model can and can't do in an afternoon. That knowledge is what tells you whether you even have a fine-tuning problem.
The honest limit: prompting can't fix missing knowledge, and it can't reliably produce a rigid output format under adversarial input no matter how loud the instruction. Hit either wall and you move on — to RAG for the first, fine-tuning for the second.
RAG when the problem is knowledge
If the model doesn't know your facts — your documentation, your policies, your customer records, your product catalog — the answer is retrieval, not training. RAG puts the relevant facts into the context at query time, so the model reasons over current, authoritative data instead of trying to recall something it was never taught.
RAG is right when: the knowledge is large (won't fit in a static prompt), changes over time (training on it would freeze it stale), or must be attributable (you can cite the source document, which matters enormously in regulated domains). A support assistant over a knowledge base, a Q&A tool over contracts, an internal search that answers in prose — all RAG, none fine-tuning.
The critical thing fine-tuning advocates miss: fine-tuning bakes knowledge in at a point in time. The day after you train, your policy doc changes, and now your model confidently cites the old policy. RAG retrieves the current document every time. For anything that changes — which is most business knowledge — retrieval is not just easier, it's correct in a way fine-tuning structurally can't be.
RAG costs more per request than prompting (embedding, retrieval, larger context) and brings real architecture — a vector store, a chunking strategy, an ingestion pipeline, retrieval quality to monitor. But it solves the knowledge problem without the irreversibility of a trained model, and it keeps your facts in a place you can update with a ./automate re-index instead of a training run.
Fine-tuning when the problem is behavior
So when does fine-tuning actually pay off? When the problem is consistent behavior at scale that prompting can't reliably enforce, and the volume justifies the investment.
Fine-tuning earns its place when: you need a specific output format or style held perfectly across millions of calls and prompting gets it right 95 percent of the time but the 5 percent costs you; you've squeezed a long, expensive prompt down and want to bake that behavior into a smaller, cheaper model to cut per-call cost at volume; or the task is a narrow, stable transformation (a particular classification scheme, a house style) that won't change and runs constantly. The signal is behavioral consistency at high volume, not the model doesn't know things.
And the reality nobody puts on the ticket: fine-tuning is a maintenance commitment, not a one-time task. You need a curated, labeled dataset — typically thousands of high-quality examples, and curating them is the real work. You need an eval set to prove the fine-tuned model is actually better and didn't regress on cases the base model handled. You own a training pipeline. When the base model improves — and it improves often — your fine-tuned version is frozen on the old one until you re-run everything. You've traded a string you can edit for a pipeline you have to operate.
That last point is the one that sinks fine-tuning projects. Teams fine-tune, ship, and move on, and 18 months later they're running a stale custom model that's worse than the current base model with a good prompt, because nobody owns re-training and the base models kept getting better.
They combine
This isn't a three-way exclusive choice. The strongest production systems layer them. A fine-tuned model that produces a precise output format, fed retrieved context via RAG, steered by a tight system prompt. Fine-tuning handles the how (format, style, behavior), RAG handles the what (current, attributable facts), prompting handles the task framing on top. Each does the job it's actually good at.
The sequence that works: prompt until you hit a wall. If the wall is missing or changing knowledge, add RAG. If the wall is behavioral consistency at a volume that justifies the maintenance, add fine-tuning on top. Never skip to step three because it sounds the most serious.
And the layering has a cost discipline to it. Each layer you add is per-request cost and operational surface you're committing to. Prompting is a string and prompt caching makes a large stable prefix nearly free on repeated calls. RAG adds embedding, retrieval, and a vector store to operate. Fine-tuning adds a training pipeline and a re-training obligation. Add layers in the order that solves your actual wall, and stop the moment the wall is gone — a system carrying a fine-tuned model it didn't need is paying maintenance for a problem prompting already solved. The strongest architectures use exactly as many layers as the problem demands and not one more, because every layer is something a future engineer has to understand, debug, and keep current as the base models keep improving underneath them.
What fixed looks like
The decision is driven by the problem, not the prestige. "The model doesn't know X" is answered with retrieval. "The model won't hold a format under load" is answered with prompting first, fine-tuning only if prompting provably can't and the volume justifies it. There's an eval set that proves whichever approach you chose is actually better than the simpler one you skipped. If you fine-tuned, someone owns re-training against new base models, and you can name the metric that justified the maintenance burden. Nothing is frozen that should have been retrieved.
This is for you if
You're a funded US company with an AI feature in production, and a fine-tuning project is being scoped that may be solving a knowledge problem with a behavior tool. An architecture review of this kind is part of an AI build or audit, typically $50k+; designing the prompting / RAG / fine-tuning split inside a larger product, with the eval harness to defend it, runs $100k+.
It's not for you if you haven't seriously tried prompting yet — fine-tuning before you understand the base model's actual ceiling is buying a pipeline blind. It's not for you if your knowledge changes weekly; that's RAG, and fine-tuning will freeze you stale. And it's not for you if you can't commit to owning re-training — a fine-tuned model nobody maintains is a depreciating asset that quietly falls behind the base model you could have prompted.