Large language models are remarkably capable, but they have a fundamental limitation: they only know what was in their training data. Ask about your company's return policy, last quarter's revenue, or the status of a customer's order, and the model either hallucinates an answer or admits it does not know.
Retrieval-Augmented Generation (RAG) solves this problem. It gives language models access to your specific data at query time, combining the model's reasoning ability with your organization's actual knowledge.
How RAG Works
The RAG pipeline has three stages:
1. Indexing
Your documents — help articles, product documentation, internal wikis, PDF reports — are broken into smaller chunks and converted into numerical representations called embeddings. These embeddings capture the semantic meaning of each chunk and are stored in a vector database.
2. Retrieval
When a user asks a question, their query is also converted into an embedding. The vector database finds the document chunks whose embeddings are most similar to the query — the chunks most likely to contain the answer.
3. Generation
The retrieved chunks are injected into the language model's prompt as context. The model generates a response grounded in your actual documents rather than its general training data.
The result is an AI system that gives accurate, specific answers based on your organization's knowledge, with the ability to cite sources.
Why RAG Beats Fine-Tuning for Most Use Cases
Fine-tuning a language model on your data is another approach to making it domain-specific. However, for most business applications, RAG has significant advantages:
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Data freshness | Instant — update the index | Requires retraining |
| Cost | Low — no training compute | High — GPU hours for training |
| Accuracy | High — grounded in actual docs | Variable — can overfit or forget |
| Transparency | Can cite sources | Cannot explain reasoning |
| Maintenance | Update documents, re-index | Retrain periodically |
Fine-tuning is better when you need to change the model's behavior or style (e.g., making it follow a specific communication tone). RAG is better when you need the model to know specific facts and information.
Building a Production RAG System
Choosing a Vector Database
The vector database is the core infrastructure component. Popular options:
- Pinecone — fully managed, scales automatically, good for teams that want minimal operational overhead
- Weaviate — open source, supports hybrid search (vector + keyword), self-hostable
- Qdrant — open source, fast, efficient memory usage, good for on-premise deployments
- pgvector — PostgreSQL extension, good if you already use Postgres and want to avoid adding another database
For most business applications, any of these will work well. Choose based on your existing infrastructure and operational preferences.
Chunking Strategy
How you split documents into chunks significantly affects retrieval quality:
- Too large — chunks contain too much information, diluting the relevant parts
- Too small — chunks lack sufficient context for the model to generate useful answers
- Overlap — chunks should overlap slightly (50-100 tokens) to avoid splitting important information across chunk boundaries
A good starting point is 500-800 token chunks with 100 token overlap. Adjust based on your content type and retrieval performance.
Embedding Model Selection
The embedding model converts text into vectors. Choose one that balances accuracy with performance:
- OpenAI text-embedding-3-small — good balance of quality and cost for most applications
- Cohere embed-v3 — strong multilingual support
- Open source (e.g., BGE, E5) — self-hostable, no API costs, good for privacy-sensitive applications
Use the same embedding model for indexing and querying. Mixing models produces poor results because different models place similar concepts at different locations in the vector space.
Retrieval Optimization
Basic vector similarity search works for simple cases. For production systems, consider:
- Hybrid search — combine vector similarity with keyword matching (BM25) for better recall
- Metadata filtering — narrow the search space by document type, date, or category before vector search
- Re-ranking — retrieve more candidates than needed, then use a cross-encoder to re-rank by relevance
- Query expansion — generate multiple phrasings of the user's query to improve recall
Response Generation
When generating the response, several prompt engineering techniques improve quality:
- Instruction clarity — explicitly tell the model to answer only from provided context
- Source attribution — ask the model to cite which document chunks it used
- Confidence signaling — instruct the model to say "I don't have information about this" rather than guessing
- Format specification — define the expected response format (length, structure, tone)
Common RAG Pitfalls
Poor Chunking
The most impactful variable is often the least considered. Tables, code blocks, and structured data need special handling — naively splitting them mid-row or mid-function produces garbage chunks.
Ignoring Metadata
Vector similarity is powerful but not always sufficient. If a user asks "What changed in version 3.2?", vector search might return chunks from version 2.8 that discuss similar concepts. Metadata filtering (version = 3.2) solves this instantly.
No Evaluation Pipeline
How do you know if your RAG system is actually returning good answers? Build an evaluation pipeline with:
- A set of test questions with known correct answers
- Automated scoring (retrieval recall, answer accuracy, faithfulness)
- Regular evaluation runs after any system changes
Without systematic evaluation, you are flying blind.
Stale Data
If your documents change frequently (product updates, policy changes, pricing adjustments), your index must be updated accordingly. Build automated re-indexing pipelines that trigger when source documents change.
When NOT to Use RAG
RAG is not always the right answer:
- Real-time data — if the answer requires live data (current stock price, order status), use API calls instead of pre-indexed documents
- Simple lookups — if the answer is always a direct database query (account balance, tracking number), RAG adds unnecessary complexity
- Creative tasks — writing, brainstorming, and code generation rarely benefit from RAG
- Behavioral changes — if you need the model to adopt a specific persona or communication style, fine-tuning is more appropriate
The RAG Maturity Curve
Organizations typically progress through three stages:
Stage 1: Basic RAG — Single document source, basic chunking, vanilla vector search. Gets you 70% of the way.
Stage 2: Optimized RAG — Multiple sources, smart chunking, hybrid search, re-ranking, metadata filtering. Gets you to 90%.
Stage 3: Agentic RAG — The retrieval system is wrapped in an AI agent that decides what to search for, evaluates the results, performs follow-up searches if needed, and synthesizes information from multiple retrievals. Gets you to 95%+.
Most businesses should start at Stage 1 and progress as needed. Stage 1 is often good enough for internal knowledge bases and FAQ-style support.
