๐๏ธRAG vs Fine-tuning vs Prompt Engineering
Three ways to inject knowledge into an LLM. Picking the right one saves months of wasted infrastructure.
Most teams over-invest in fine-tuning because it sounds sophisticated. The reality: prompting + RAG handles 95% of production cases at lower cost and faster iteration. Knowing when each is right is foundational AI PM judgment.
Default to prompt engineering โ it's the fastest. Add RAG when you have a corpus the model needs to ground in. Use fine-tuning only when prompt + RAG can't achieve consistency or specialized behavior. The decision matters because the infrastructure investment for fine-tuning is 5-10x that of RAG.
The three approaches
Prompt engineering. Craft the system prompt + few-shot examples to elicit the behavior you want. No training. No infrastructure. Iterate in hours.
When it's enough:
- Behavior can be specified in instructions
- Knowledge fits in context window
- Knowledge doesn't change often
When it's not:
- Knowledge is too large for context (whole knowledge base)
- Knowledge changes frequently
- Model needs to learn a specialized style or capability
RAG (Retrieval-Augmented Generation). Store corpus in vector DB. At query time, retrieve relevant chunks. Inject into prompt. Let the LLM answer based on retrieved context.
When to use:
- Large or constantly-updating corpus (docs, support tickets, internal wiki)
- Need fresh data (model's training cutoff doesn't include recent info)
- Need citations / source attribution
- Cost-sensitive (smaller prompts than stuffing everything in)
This is the dominant production AI pattern in 2026.
Fine-tuning. Further-train a base model on your data. Either supervised fine-tuning (input-output pairs) or RLHF (reinforcement learning from human feedback).
When to use:
- Need consistent output format that prompt can't enforce
- Specialized capability the base model lacks
- High volume where smaller fine-tuned model is cheaper than frontier model per call
- Domain language so different that prompting can't bridge
When NOT to use:
- Knowledge injection (RAG is better โ easier to update, cheaper)
- General improvement (frontier models keep improving; your fine-tuned model gets stale)
- Quick iteration (fine-tuning takes days to weeks; prompting takes minutes)
The decision tree
- Can you specify the behavior in <2K tokens of prompt + examples? โ Prompting.
- Do you have a corpus the model needs to reference? โ Add RAG.
- Even with RAG, is the model failing consistently? โ Consider fine-tuning, but try prompt + RAG with frontier model first.
The cost economics
For a moderate AI feature (5K input + 2K output, 100K calls/month):
- Prompting + RAG with frontier model: ~$5K-15K/month
- Fine-tuned smaller model: ~$2K-5K/month + $5-20K one-time fine-tuning cost
- Self-hosted open-source model: ~$2K-10K/month (infra) + engineering time
Fine-tuning's cost advantage only materializes at high call volumes (millions/month). Most products don't get there.
The hybrid that works in 2026
The pattern most production AI products converge on:
- Frontier model + RAG for complex/high-value calls
- Smaller model for high-volume routing and intent classification
- Fine-tuned model for specialized capabilities (very rare)
- Eval-driven optimization at every layer
Real-world examples
Perplexity is essentially a sophisticated RAG product โ search the web, retrieve relevant content, ground the LLM in real sources. The fine-tuning matters less than the retrieval quality. The discipline of investing in retrieval over model-tuning is one reason they out-product competitors.
Go deeper โ recommended reading
Interview questions (1)
Q1I'm building a customer support AI. Should I use RAG or fine-tune?ai-pmseniorโผ
Almost certainly RAG.
Customer support requires three things AI must do well: (1) answer based on accurate, current information (your docs, your product), (2) cite sources so the human escalator can verify, (3) handle the case where the answer isn't in the corpus.
RAG handles all three. Fine-tuning handles none well โ fine-tuned models can't easily cite, don't update when your docs change, and tend to hallucinate when the question is outside the training data.
The right architecture: store your docs + support tickets in a vector DB, retrieve top relevant chunks per query, prompt a frontier model with the chunks + clear instructions to cite or escalate. Build an eval suite with 100+ real customer questions + expected answers. Measure.
You'd consider fine-tuning only if RAG + prompting + frontier model couldn't achieve the quality bar, AND you had millions of queries per month where the cost difference mattered. For most companies, that day never comes.