Retrieval-Augmented Generation

Beginner

Think of RAG as open-book answering. The model is allowed to consult a library before responding. This is useful when facts change, when the knowledge is private, or when you need the answer tied to a source rather than to the model's internal memory.

Why RAG exists

Model knowledge gets stale: product docs, policies, prices, and regulations change.
Many facts are private: the model was never trained on your internal documents.
Users need verification: grounded answers can cite the supporting evidence.
Updating retrieval is cheaper than retraining: you can refresh the knowledge base without rebuilding the model.

Classic RAG flow

User question
    |
    v
Search the knowledge base
    |
    v
Retrieve top relevant passages
    |
    v
Assemble prompt with evidence
    |
    v
LLM writes answer and cites sources

Real-world example: an internal HR assistant should answer from the current employee handbook, not from a general idea of what HR policies often look like.

When RAG is a good fit

Enterprise search assistants over internal documents.
Customer support bots that must reference the latest documentation.
Compliance or policy assistants that need source-backed answers.
Research copilots that summarize retrieved notes, papers, or reports.

When RAG is not enough by itself

If the answer requires heavy reasoning over many scattered facts, simple one-shot retrieval may fail.
If the knowledge source is incomplete or outdated, the answer will still be incomplete or outdated.
If the task is mostly behavior change rather than knowledge access, fine-tuning may be the better tool.

RAG does not make the model magically truthful. It gives the model a better chance to be truthful by putting evidence in front of it.

documents = [
    "Vacation policy: full-time employees receive 20 days per year.",
    "Parental leave policy: 16 weeks paid leave for primary caregivers."
]

query = "How many vacation days do full-time employees get?"

# Very small stand-in for retrieval.
def retrieve(query, documents):
    keywords = set(query.lower().split())
    scored = []
    for doc in documents:
        overlap = len(keywords & set(doc.lower().split()))
        scored.append((overlap, doc))
    scored.sort(reverse=True)
    return [doc for score, doc in scored if score > 0][:2]

contexts = retrieve(query, documents)

prompt = f"""
Answer the question using only the context below.
If the answer is not supported, say: Not enough evidence.

Question: {query}
Context:
- """ + "\n- ".join(contexts)

print(prompt)

Advanced

High-quality RAG depends more on retrieval quality, evidence selection, and answer constraints than on raw model cleverness. Many supposed model failures are actually retrieval failures: wrong passages were fetched, the right passages were ranked too low, or the prompt encouraged unsupported synthesis.

Anatomy of a strong RAG system

Query understanding: rewrite vague user questions into search-friendly queries, especially when pronouns, abbreviations, or prior chat context matter.
Retrieval breadth: the first pass should maximize recall so the right evidence is not missed.
Ranking precision: a second pass such as semantic ranking or reranking improves the final context set.
Grounded prompting: explicitly instruct the model to answer only from retrieved evidence and to abstain otherwise.
Provenance: return citations, source IDs, or quoted spans so the user can inspect the basis of the answer.

Retrieval patterns worth knowing

Pattern	Why it helps	Main risk
Top-k retrieval	Simple baseline for pulling a small evidence set.	Relevant evidence may sit just below the cutoff.
Hybrid retrieval	Combines semantic similarity with exact-term matching for codes, names, or policy IDs.	Fusion and weighting can be tuned poorly.
Reranking	Improves precision after a broad first-pass search.	Adds latency and extra model calls.
Multi-query retrieval	Breaks a complex question into focused subqueries and can recover evidence missed by a single query.	Can retrieve too much context or duplicate evidence.

Common RAG failure modes

Retrieval miss: the answer exists in the corpus but the system never retrieves it.
Ranking error: useful evidence is found but lower-quality passages occupy the context window.
Context dilution: too many mediocre passages reduce the signal from the best ones.
Unsupported synthesis: the model combines fragments into a confident claim not fully justified by the sources.
Stale grounding: the pipeline retrieves an outdated policy and presents it as current.
Permission leak: the system retrieves documents the user should not be allowed to see.

def build_rag_prompt(question, contexts):
    joined = "\n\n".join(
        f"Source {i+1}: {text['text']}\nCitation: {text['source']}"
        for i, text in enumerate(contexts)
    )
    return f"""
Answer the question using only the provided context.
If the answer is not supported, say you do not have enough evidence.
Quote the source number you used.

Question: {question}

{joined}
"""

contexts = [
    {"text": "Vacation policy: full-time employees receive 20 days.", "source": "hr_policy_v3.md"},
    {"text": "Part-time employees receive prorated leave.", "source": "hr_policy_v3.md"},
]

print(build_rag_prompt("What is the vacation policy?", contexts))

How to evaluate RAG

Do not judge a RAG system only by whether the final answer sounds good. Evaluate retrieval and generation as separate layers.

Retrieval recall: did the relevant evidence appear in the retrieved set at all?
Context precision: how much of the supplied context was actually useful?
Faithfulness: does the answer stay consistent with the evidence?
Citation accuracy: do cited passages really support the specific claim?
Abstention quality: when evidence is weak, does the system decline instead of guessing?

RAG debugging order:

1. Did the corpus contain the answer?
2. Did retrieval fetch the right evidence?
3. Did ranking place that evidence high enough?
4. Did the prompt force grounded answering?
5. Did the model still invent unsupported details?

Modern RAG systems may add query rewriting, metadata filters, conversational memory for follow-up questions, and multi-step retrieval for harder tasks. The core discipline stays the same: retrieve the right evidence, keep the context focused, and require the answer to stay grounded in what was retrieved.

If you cannot explain why a passage was retrieved, why it outranked alternatives, and why it is sufficient evidence for the answer, the RAG system is not mature yet.

Keep this module separate from the mechanics of embeddings, vector databases, and chunking. RAG depends on them, but its own core topic is grounded generation over retrieved evidence.

Search Algorithms

Retrieval quality is the single biggest lever in a RAG system. The algorithm that matches a query to documents determines which evidence the LLM ever sees. There are three families: dense vector search, sparse keyword search, and hybrid methods that combine them.

Cosine Similarity

The most widely used similarity measure for dense embeddings. It measures the angle between two vectors and ignores magnitude, which makes it robust to documents of different lengths.

cosine_similarity(A, B) = (A · B) / (‖A‖ × ‖B‖)

Range: −1 (opposite) to +1 (identical direction)
Typical threshold for "relevant": > 0.75

import numpy as np

def cosine_similarity(a, b):
    dot = np.dot(a, b)
    norm = np.linalg.norm(a) * np.linalg.norm(b)
    return dot / norm if norm > 0 else 0.0

query_emb   = np.array([0.6, 0.8, 0.0])   # pretend embedding
doc_emb     = np.array([0.5, 0.7, 0.5])

score = cosine_similarity(query_emb, doc_emb)
print(f"Similarity: {score:.4f}")          # 0.9285 → very relevant

Most vector databases (Pinecone, Weaviate, Qdrant) index on cosine similarity by default. Always normalise your embeddings to unit length first — then cosine similarity equals dot product, which is faster to compute.

Dot Product

Raw inner product of two vectors. Faster than cosine because no normalisation step is needed, but magnitude matters: a long document whose embedding has high magnitude will always score higher than a short but equally relevant one unless you normalise first.

def dot_product_similarity(a, b):
    return float(np.dot(a, b))

# Unit-normalise first → equivalent to cosine
def normalise(v):
    n = np.linalg.norm(v)
    return v / n if n > 0 else v

score = dot_product_similarity(normalise(query_emb), normalise(doc_emb))

Euclidean Distance (L2)

Measures straight-line distance between two vectors in embedding space. Lower distance means higher similarity. Less popular than cosine for text because it is sensitive to embedding magnitude.

L2(A, B) = √( Σ (Aᵢ − Bᵢ)² )

Lower is better. Convert to similarity: sim = 1 / (1 + distance)

def l2_similarity(a, b):
    distance = np.linalg.norm(a - b)
    return 1.0 / (1.0 + distance)   # map to [0, 1]

BM25 (Sparse / Keyword Search)

Best Matching 25 is the standard sparse retrieval algorithm used by Elasticsearch and Solr. It scores documents based on term frequency (TF) and inverse document frequency (IDF) with length normalisation. Excels at exact-term matching: product codes, names, identifiers, rare technical terms.

BM25(q, d) = Σ IDF(tᵢ) × [ tf(tᵢ,d) × (k₁+1) ]
                            ─────────────────────────
                            tf(tᵢ,d) + k₁×(1−b+b×|d|/avgdl)

k₁ ≈ 1.2–2.0  (TF saturation)
b   ≈ 0.75    (length normalisation)

from rank_bm25 import BM25Okapi

corpus = [
    "Vacation policy: 20 days for full-time employees.",
    "Parental leave: 16 weeks paid for primary caregivers.",
    "Sick leave: 10 days per calendar year.",
]

tokenised = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenised)

query  = "how many vacation days"
scores = bm25.get_scores(query.lower().split())

ranked = sorted(zip(scores, corpus), reverse=True)
for score, doc in ranked:
    print(f"{score:.3f}  {doc}")

Approximate Nearest Neighbour (ANN)

Exact nearest-neighbour search over millions of vectors is too slow for production. ANN algorithms trade a small accuracy loss for orders-of-magnitude speed gain. The two dominant approaches are HNSW (graph-based) and IVF (inverted-file / cluster-based). FAISS is the most widely used library for both.

Algorithm	How it works	Best for
HNSW	Hierarchical navigable small-world graph; traverses a multi-layer graph to find close neighbours fast.	Low latency, high recall. Default in most vector DBs.
IVF (FAISS)	Clusters vectors into Voronoi cells; searches only nearby cells at query time.	Very large corpora; tunable speed / recall trade-off.
LSH	Locality-sensitive hashing maps similar vectors to the same buckets with high probability.	Simple to implement; lower recall than HNSW.
ScaNN (Google)	Anisotropic quantisation + tree-based search; optimised for inner-product retrieval.	Highest throughput at Google scale.

import faiss
import numpy as np

dim  = 128
n    = 10_000
k    = 5         # top-k results

# Build index (cosine ≈ L2 after normalisation)
index = faiss.IndexFlatIP(dim)   # inner product on unit vectors
vecs  = np.random.randn(n, dim).astype("float32")
faiss.normalize_L2(vecs)
index.add(vecs)

query = np.random.randn(1, dim).astype("float32")
faiss.normalize_L2(query)

scores, indices = index.search(query, k)
print("Top-k indices:", indices[0])
print("Scores:       ", scores[0])

Maximal Marginal Relevance (MMR)

Plain top-k retrieval can return several near-duplicate passages that waste context tokens. MMR re-ranks candidates by balancing relevance to the query against redundancy with already-selected passages. The parameter λ controls the trade-off.

MMR score = λ × sim(doc, query) − (1−λ) × max sim(doc, already_selected)

λ = 1  →  pure relevance (same as top-k)
λ = 0  →  pure diversity
λ ≈ 0.5–0.7 is the typical starting point

def mmr(query_emb, doc_embs, docs, k=4, lam=0.6):
    selected, remaining = [], list(range(len(docs)))

    while len(selected) < k and remaining:
        mmr_scores = []
        for i in remaining:
            rel = cosine_similarity(query_emb, doc_embs[i])
            if selected:
                red = max(cosine_similarity(doc_embs[i], doc_embs[j])
                          for j in selected)
            else:
                red = 0.0
            mmr_scores.append((lam * rel - (1 - lam) * red, i))

        best = max(mmr_scores)[1]
        selected.append(best)
        remaining.remove(best)

    return [docs[i] for i in selected]

Hybrid Search and Score Fusion

Dense search captures semantic intent; sparse search handles exact terms. Combining them almost always outperforms either alone. The standard fusion method is Reciprocal Rank Fusion (RRF), which merges rank lists without needing to normalise raw scores.

RRF score(doc) = Σ  1 / (k + rank_in_list_i)
                 i

k = 60 (standard default)

Higher total score → better combined rank

def reciprocal_rank_fusion(ranked_lists, k=60):
    """
    ranked_lists: list of lists of doc IDs, best-first.
    Returns: dict of doc_id → fused score (higher is better).
    """
    scores = {}
    for ranked in ranked_lists:
        for rank, doc_id in enumerate(ranked):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

dense_results  = ["doc3", "doc1", "doc5", "doc2"]   # from cosine search
sparse_results = ["doc1", "doc4", "doc3", "doc6"]   # from BM25

fused = reciprocal_rank_fusion([dense_results, sparse_results])
for doc_id, score in fused:
    print(f"{doc_id}  {score:.4f}")

Algorithm Selection Guide

Scenario	Recommended approach
General semantic QA	Cosine similarity with normalised dense embeddings
Exact codes, names, IDs	BM25 or keyword filter on metadata
Mixed queries	Hybrid (dense + BM25) fused with RRF
Diverse context, avoid duplicates	MMR re-ranking after initial retrieval
Millions of documents, low latency	HNSW index (FAISS or vector DB)
Highest accuracy, small corpus	Exact flat index (brute-force cosine)

Run retrieval offline on a labelled set before choosing an algorithm. Measure recall@k for your actual queries — intuition about which algorithm is "best" is often wrong for a specific domain.

Manual RAG — End-to-End Concept

The cleanest way to understand RAG is to build it with no external vector database. Four steps, all in plain Python: chunk, embed, retrieve with cosine similarity, generate.

1. Chunk with overlap

Split the document into overlapping word windows so sentences that span a boundary still appear intact in at least one chunk.

def chunk_text(text: str, size: int = 20, overlap: int = 5) -> list[str]:
    words = text.split()
    step  = size - overlap
    return [
        " ".join(words[i : i + size])
        for i in range(0, len(words), step)
        if words[i : i + size]
    ]

2. Embed all chunks

Call the embeddings API once for the whole chunk list. Store the vectors alongside their source text.

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(model="voyage-3", input=texts)
    return [item.embedding for item in response.data]

chunks           = chunk_text(document)
chunk_embeddings = embed(chunks)   # embed once, reuse for every query

3. Retrieve — cosine similarity by hand

Embed the user query, score every chunk, return the top-k.

import math

def cosine(a, b):
    dot    = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

def retrieve(query, chunks, chunk_embeddings, top_k=3):
    [q_emb] = embed([query])
    scored  = [(cosine(q_emb, emb), chunk)
               for emb, chunk in zip(chunk_embeddings, chunks)]
    scored.sort(reverse=True)
    return [chunk for _, chunk in scored[:top_k]]

4. Generate with context

Inject the top chunks into the prompt and ask Claude to answer.

def ask(question, context_chunks):
    context = "\n".join(f"- {c}" for c in context_chunks)
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}",
        }],
    )
    return response.content[0].text

top_chunks = retrieve("How do I define a function?", chunks, chunk_embeddings)
print(ask("How do I define a function?", top_chunks))

See examples/rag_manual.py for the full runnable version. No vector database — just pip install anthropic.

To-do list

Learn

Understand the difference between parametric knowledge and retrieved evidence.
Learn why retrieval quality often dominates final answer quality in RAG systems.
Study the major failure modes: retrieval miss, context dilution, unsupported synthesis, and stale evidence.
Learn how grounding, citations, abstention, recall, and precision relate to trustworthy answers.

Practice

Build a tiny corpus and inspect the retrieved evidence for at least ten realistic user questions.
Write follow-up questions that rely on chat history and test whether query rewriting improves retrieval.
Create adversarial prompts where the answer is absent and verify that the system abstains cleanly.
Compare answers with and without citations and judge which version is easier to verify.

Build

Create a source-grounded assistant over a custom document set with citation output.
Add a retrieval log that records the query, returned passages, and final answer for debugging.
Implement a small evaluation set with expected evidence and expected abstentions.
Add a confidence gate that declines answers when retrieved support is weak or contradictory.