RAG: Retrieval-Augmented Generation (Deep Dive)
In Lesson 1, we introduced RAG as a way to ground LLM responses in external knowledge. Now we go deep: production-grade chunking strategies, embedding models, vector databases, retrieval techniques, and evaluation frameworks.
Why RAG Over Fine-Tuning?
Chunking Strategies
Before embedding, documents must be split into chunks — passages small enough to be individually embedded and retrieved. Chunk size critically affects RAG quality.
Fixed-Size Chunking
Split text into chunks of N characters (or tokens) with optional overlap.
def fixed_size_chunk(text, chunk_size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
Pros: Simple, predictable chunk sizes Cons: May split sentences or paragraphs mid-thought
Semantic Chunking
Split at natural boundaries (paragraphs, sections, sentences) and merge small chunks up to a size limit.
import redef semantic_chunk(text, max_size=500):
# Split on double newlines (paragraph boundaries)
paragraphs = re.split(r'\n\n+', text)
chunks = []
current = ""
for para in paragraphs:
if len(current) + len(para) < max_size:
current += para + "\n\n"
else:
if current:
chunks.append(current.strip())
current = para + "\n\n"
if current:
chunks.append(current.strip())
return chunks
Pros: Preserves semantic coherence Cons: Variable chunk sizes
Recursive Chunking
Used by LangChain's RecursiveCharacterTextSplitter. Tries to split on the largest natural boundary first ("\n\n"), then falls back to smaller ones ("\n", ". ", " ").
| Strategy | Best For |
|---|---|
| Fixed-size | Uniform content (e.g., product descriptions) |
| Semantic | Structured documents (articles, reports) |
| Recursive | General-purpose, mixed content |
Chunk Size Guidelines
Always Add Overlap
Embedding Models
Embedding models convert text into dense vectors that capture semantic meaning. Similar texts produce similar vectors.
| Model | Dimensions | Speed | Quality | Open Source |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast | Good | Yes |
| all-mpnet-base-v2 | 768 | Medium | Better | Yes |
| text-embedding-3-small (OpenAI) | 1536 | API | Very Good | No |
| text-embedding-3-large (OpenAI) | 3072 | API | Excellent | No |
| embed-v3 (Cohere) | 1024 | API | Excellent | No |
Vector Databases
Vector databases are purpose-built to store, index, and search high-dimensional vectors efficiently.
| Database | Type | Best For |
|---|---|---|
| Chroma | Embedded (local) | Prototyping, small datasets |
| Pinecone | Managed cloud | Production, zero-ops |
| Weaviate | Self-hosted / cloud | Hybrid search, GraphQL |
| pgvector | PostgreSQL extension | Teams already using Postgres |
| Qdrant | Self-hosted / cloud | High performance, filtering |
| FAISS | Library (Meta) | Research, maximum speed |
Chroma Example
import chromadbclient = chromadb.Client()
collection = client.create_collection("my_docs")
Add documents (Chroma embeds them automatically)
collection.add(
documents=["Doc 1 text...", "Doc 2 text..."],
ids=["doc1", "doc2"],
metadatas=[{"source": "wiki"}, {"source": "blog"}]
)Query
results = collection.query(
query_texts=["What is machine learning?"],
n_results=3
)
print(results["documents"])
Retrieval Strategies
Similarity Search (Basic)
Find the k vectors closest to the query vector using cosine similarity or L2 distance. Simple but may return redundant results.
MMR (Maximal Marginal Relevance)
Balances relevance and diversity. After finding the most relevant chunk, each subsequent chunk is chosen to be relevant to the query BUT different from already-selected chunks.
MMR = argmax [ lambda * sim(doc, query) - (1 - lambda) * max(sim(doc, selected)) ]
Hybrid Search
Combines semantic search (embeddings) with keyword search (BM25/TF-IDF). This handles cases where exact keyword matches matter (e.g., product IDs, technical terms).
final_score = alpha * semantic_score + (1 - alpha) * keyword_score
Production RAG Patterns
Re-Ranking
After initial retrieval (fast, approximate), use a cross-encoder model to re-rank the top results more accurately.
from sentence_transformers import CrossEncoderreranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
Initial retrieval returns 20 candidates
candidates = [...]Re-rank with cross-encoder
scores = reranker.predict(
[(query, doc) for doc in candidates]
)Sort by re-ranked score
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
top_docs = [doc for doc, score in reranked[:5]]
Query Expansion
Use the LLM to rewrite or expand the user's query before retrieval:
Original: "Python web frameworks"
Expanded: "Python web frameworks Django Flask FastAPI comparison"
HyDE (Hypothetical Document Embeddings)
Instead of embedding the query directly, ask the LLM to generate a hypothetical answer, then embed that answer for retrieval. The hypothetical answer is closer in embedding space to the actual relevant documents.
Query: "How does photosynthesis work?"
HyDE: "Photosynthesis is the process by which plants convert sunlight,
water, and CO2 into glucose and oxygen using chlorophyll..."Embed the HyDE text -> search -> retrieve -> generate final answer
The RAG Evaluation Challenge
RAG Evaluation
The RAGAS framework provides standardized metrics for evaluating RAG systems:
| Metric | What It Measures |
|---|---|
| Faithfulness | Is the answer supported by the retrieved context? (no hallucination) |
| Answer Relevancy | Does the answer actually address the question? |
| Context Precision | Are the retrieved documents relevant to the question? |
| Context Recall | Were all necessary documents retrieved? |
Manual Evaluation Approach
def evaluate_faithfulness(answer: str, context: str) -> float:
"""Check what fraction of answer claims are supported by context."""
# Split answer into claims
# Check each claim against context
# Return supported_claims / total_claims
passdef evaluate_context_precision(retrieved_docs: list, relevant_docs: list) -> float:
"""What fraction of retrieved docs are actually relevant?"""
relevant_set = set(relevant_docs)
hits = sum(1 for doc in retrieved_docs if doc in relevant_set)
return hits / len(retrieved_docs) if retrieved_docs else 0.0
def evaluate_context_recall(retrieved_docs: list, relevant_docs: list) -> float:
"""What fraction of relevant docs were actually retrieved?"""
retrieved_set = set(retrieved_docs)
hits = sum(1 for doc in relevant_docs if doc in retrieved_set)
return hits / len(relevant_docs) if relevant_docs else 0.0