Subject 05

Retrieval-Augmented Generation (RAG)

RAG grounds an LLM in external evidence at question time. Instead of relying only on what the model memorized during training, the system retrieves relevant passages from a knowledge source and asks the model to answer from those passages.

Beginner

Think of RAG as open-book answering. The model is allowed to consult a library before responding. This is useful when facts change, when the knowledge is private, or when you need the answer tied to a source rather than to the model's internal memory.

Why RAG exists

Classic RAG flow

User question
    |
    v
Search the knowledge base
    |
    v
Retrieve top relevant passages
    |
    v
Assemble prompt with evidence
    |
    v
LLM writes answer and cites sources
				

Real-world example: an internal HR assistant should answer from the current employee handbook, not from a general idea of what HR policies often look like.

When RAG is a good fit

When RAG is not enough by itself

RAG does not make the model magically truthful. It gives the model a better chance to be truthful by putting evidence in front of it.

documents = [
    "Vacation policy: full-time employees receive 20 days per year.",
    "Parental leave policy: 16 weeks paid leave for primary caregivers."
]

query = "How many vacation days do full-time employees get?"

# Very small stand-in for retrieval.
def retrieve(query, documents):
    keywords = set(query.lower().split())
    scored = []
    for doc in documents:
        overlap = len(keywords & set(doc.lower().split()))
        scored.append((overlap, doc))
    scored.sort(reverse=True)
    return [doc for score, doc in scored if score > 0][:2]

contexts = retrieve(query, documents)

prompt = f"""
Answer the question using only the context below.
If the answer is not supported, say: Not enough evidence.

Question: {query}
Context:
- """ + "\n- ".join(contexts)

print(prompt)

Advanced

High-quality RAG depends more on retrieval quality, evidence selection, and answer constraints than on raw model cleverness. Many supposed model failures are actually retrieval failures: wrong passages were fetched, the right passages were ranked too low, or the prompt encouraged unsupported synthesis.

Anatomy of a strong RAG system

Retrieval patterns worth knowing

Pattern Why it helps Main risk
Top-k retrieval Simple baseline for pulling a small evidence set. Relevant evidence may sit just below the cutoff.
Hybrid retrieval Combines semantic similarity with exact-term matching for codes, names, or policy IDs. Fusion and weighting can be tuned poorly.
Reranking Improves precision after a broad first-pass search. Adds latency and extra model calls.
Multi-query retrieval Breaks a complex question into focused subqueries and can recover evidence missed by a single query. Can retrieve too much context or duplicate evidence.

Common RAG failure modes

def build_rag_prompt(question, contexts):
    joined = "\n\n".join(
        f"Source {i+1}: {text['text']}\nCitation: {text['source']}"
        for i, text in enumerate(contexts)
    )
    return f"""
Answer the question using only the provided context.
If the answer is not supported, say you do not have enough evidence.
Quote the source number you used.

Question: {question}

{joined}
"""

contexts = [
    {"text": "Vacation policy: full-time employees receive 20 days.", "source": "hr_policy_v3.md"},
    {"text": "Part-time employees receive prorated leave.", "source": "hr_policy_v3.md"},
]

print(build_rag_prompt("What is the vacation policy?", contexts))

How to evaluate RAG

Do not judge a RAG system only by whether the final answer sounds good. Evaluate retrieval and generation as separate layers.

RAG debugging order:

1. Did the corpus contain the answer?
2. Did retrieval fetch the right evidence?
3. Did ranking place that evidence high enough?
4. Did the prompt force grounded answering?
5. Did the model still invent unsupported details?
				

Modern RAG systems may add query rewriting, metadata filters, conversational memory for follow-up questions, and multi-step retrieval for harder tasks. The core discipline stays the same: retrieve the right evidence, keep the context focused, and require the answer to stay grounded in what was retrieved.

If you cannot explain why a passage was retrieved, why it outranked alternatives, and why it is sufficient evidence for the answer, the RAG system is not mature yet.

Keep this module separate from the mechanics of embeddings, vector databases, and chunking. RAG depends on them, but its own core topic is grounded generation over retrieved evidence.

Search Algorithms

Retrieval quality is the single biggest lever in a RAG system. The algorithm that matches a query to documents determines which evidence the LLM ever sees. There are three families: dense vector search, sparse keyword search, and hybrid methods that combine them.

Cosine Similarity

The most widely used similarity measure for dense embeddings. It measures the angle between two vectors and ignores magnitude, which makes it robust to documents of different lengths.

cosine_similarity(A, B) = (A ยท B) / (โ€–Aโ€– ร— โ€–Bโ€–)

Range: โˆ’1 (opposite) to +1 (identical direction)
Typical threshold for "relevant": > 0.75
				
import numpy as np

def cosine_similarity(a, b):
    dot = np.dot(a, b)
    norm = np.linalg.norm(a) * np.linalg.norm(b)
    return dot / norm if norm > 0 else 0.0

query_emb   = np.array([0.6, 0.8, 0.0])   # pretend embedding
doc_emb     = np.array([0.5, 0.7, 0.5])

score = cosine_similarity(query_emb, doc_emb)
print(f"Similarity: {score:.4f}")          # 0.9285 โ†’ very relevant

Most vector databases (Pinecone, Weaviate, Qdrant) index on cosine similarity by default. Always normalise your embeddings to unit length first โ€” then cosine similarity equals dot product, which is faster to compute.

Dot Product

Raw inner product of two vectors. Faster than cosine because no normalisation step is needed, but magnitude matters: a long document whose embedding has high magnitude will always score higher than a short but equally relevant one unless you normalise first.

def dot_product_similarity(a, b):
    return float(np.dot(a, b))

# Unit-normalise first โ†’ equivalent to cosine
def normalise(v):
    n = np.linalg.norm(v)
    return v / n if n > 0 else v

score = dot_product_similarity(normalise(query_emb), normalise(doc_emb))

Euclidean Distance (L2)

Measures straight-line distance between two vectors in embedding space. Lower distance means higher similarity. Less popular than cosine for text because it is sensitive to embedding magnitude.

L2(A, B) = โˆš( ฮฃ (Aแตข โˆ’ Bแตข)ยฒ )

Lower is better. Convert to similarity: sim = 1 / (1 + distance)
				
def l2_similarity(a, b):
    distance = np.linalg.norm(a - b)
    return 1.0 / (1.0 + distance)   # map to [0, 1]

BM25 (Sparse / Keyword Search)

Best Matching 25 is the standard sparse retrieval algorithm used by Elasticsearch and Solr. It scores documents based on term frequency (TF) and inverse document frequency (IDF) with length normalisation. Excels at exact-term matching: product codes, names, identifiers, rare technical terms.

BM25(q, d) = ฮฃ IDF(tแตข) ร— [ tf(tแตข,d) ร— (kโ‚+1) ]
                            โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
                            tf(tแตข,d) + kโ‚ร—(1โˆ’b+bร—|d|/avgdl)

kโ‚ โ‰ˆ 1.2โ€“2.0  (TF saturation)
b   โ‰ˆ 0.75    (length normalisation)
				
from rank_bm25 import BM25Okapi

corpus = [
    "Vacation policy: 20 days for full-time employees.",
    "Parental leave: 16 weeks paid for primary caregivers.",
    "Sick leave: 10 days per calendar year.",
]

tokenised = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenised)

query  = "how many vacation days"
scores = bm25.get_scores(query.lower().split())

ranked = sorted(zip(scores, corpus), reverse=True)
for score, doc in ranked:
    print(f"{score:.3f}  {doc}")

Approximate Nearest Neighbour (ANN)

Exact nearest-neighbour search over millions of vectors is too slow for production. ANN algorithms trade a small accuracy loss for orders-of-magnitude speed gain. The two dominant approaches are HNSW (graph-based) and IVF (inverted-file / cluster-based). FAISS is the most widely used library for both.

Algorithm How it works Best for
HNSW Hierarchical navigable small-world graph; traverses a multi-layer graph to find close neighbours fast. Low latency, high recall. Default in most vector DBs.
IVF (FAISS) Clusters vectors into Voronoi cells; searches only nearby cells at query time. Very large corpora; tunable speed / recall trade-off.
LSH Locality-sensitive hashing maps similar vectors to the same buckets with high probability. Simple to implement; lower recall than HNSW.
ScaNN (Google) Anisotropic quantisation + tree-based search; optimised for inner-product retrieval. Highest throughput at Google scale.
import faiss
import numpy as np

dim  = 128
n    = 10_000
k    = 5         # top-k results

# Build index (cosine โ‰ˆ L2 after normalisation)
index = faiss.IndexFlatIP(dim)   # inner product on unit vectors
vecs  = np.random.randn(n, dim).astype("float32")
faiss.normalize_L2(vecs)
index.add(vecs)

query = np.random.randn(1, dim).astype("float32")
faiss.normalize_L2(query)

scores, indices = index.search(query, k)
print("Top-k indices:", indices[0])
print("Scores:       ", scores[0])

Maximal Marginal Relevance (MMR)

Plain top-k retrieval can return several near-duplicate passages that waste context tokens. MMR re-ranks candidates by balancing relevance to the query against redundancy with already-selected passages. The parameter ฮป controls the trade-off.

MMR score = ฮป ร— sim(doc, query) โˆ’ (1โˆ’ฮป) ร— max sim(doc, already_selected)

ฮป = 1  โ†’  pure relevance (same as top-k)
ฮป = 0  โ†’  pure diversity
ฮป โ‰ˆ 0.5โ€“0.7 is the typical starting point
				
def mmr(query_emb, doc_embs, docs, k=4, lam=0.6):
    selected, remaining = [], list(range(len(docs)))

    while len(selected) < k and remaining:
        mmr_scores = []
        for i in remaining:
            rel = cosine_similarity(query_emb, doc_embs[i])
            if selected:
                red = max(cosine_similarity(doc_embs[i], doc_embs[j])
                          for j in selected)
            else:
                red = 0.0
            mmr_scores.append((lam * rel - (1 - lam) * red, i))

        best = max(mmr_scores)[1]
        selected.append(best)
        remaining.remove(best)

    return [docs[i] for i in selected]

Hybrid Search and Score Fusion

Dense search captures semantic intent; sparse search handles exact terms. Combining them almost always outperforms either alone. The standard fusion method is Reciprocal Rank Fusion (RRF), which merges rank lists without needing to normalise raw scores.

RRF score(doc) = ฮฃ  1 / (k + rank_in_list_i)
                 i

k = 60 (standard default)

Higher total score โ†’ better combined rank
				
def reciprocal_rank_fusion(ranked_lists, k=60):
    """
    ranked_lists: list of lists of doc IDs, best-first.
    Returns: dict of doc_id โ†’ fused score (higher is better).
    """
    scores = {}
    for ranked in ranked_lists:
        for rank, doc_id in enumerate(ranked):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

dense_results  = ["doc3", "doc1", "doc5", "doc2"]   # from cosine search
sparse_results = ["doc1", "doc4", "doc3", "doc6"]   # from BM25

fused = reciprocal_rank_fusion([dense_results, sparse_results])
for doc_id, score in fused:
    print(f"{doc_id}  {score:.4f}")

Algorithm Selection Guide

Scenario Recommended approach
General semantic QA Cosine similarity with normalised dense embeddings
Exact codes, names, IDs BM25 or keyword filter on metadata
Mixed queries Hybrid (dense + BM25) fused with RRF
Diverse context, avoid duplicates MMR re-ranking after initial retrieval
Millions of documents, low latency HNSW index (FAISS or vector DB)
Highest accuracy, small corpus Exact flat index (brute-force cosine)

Run retrieval offline on a labelled set before choosing an algorithm. Measure recall@k for your actual queries โ€” intuition about which algorithm is "best" is often wrong for a specific domain.

Manual RAG โ€” End-to-End Concept

The cleanest way to understand RAG is to build it with no external vector database. Four steps, all in plain Python: chunk, embed, retrieve with cosine similarity, generate.

1. Chunk with overlap

Split the document into overlapping word windows so sentences that span a boundary still appear intact in at least one chunk.

def chunk_text(text: str, size: int = 20, overlap: int = 5) -> list[str]:
    words = text.split()
    step  = size - overlap
    return [
        " ".join(words[i : i + size])
        for i in range(0, len(words), step)
        if words[i : i + size]
    ]

2. Embed all chunks

Call the embeddings API once for the whole chunk list. Store the vectors alongside their source text.

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(model="voyage-3", input=texts)
    return [item.embedding for item in response.data]

chunks           = chunk_text(document)
chunk_embeddings = embed(chunks)   # embed once, reuse for every query

3. Retrieve โ€” cosine similarity by hand

Embed the user query, score every chunk, return the top-k.

import math

def cosine(a, b):
    dot    = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

def retrieve(query, chunks, chunk_embeddings, top_k=3):
    [q_emb] = embed([query])
    scored  = [(cosine(q_emb, emb), chunk)
               for emb, chunk in zip(chunk_embeddings, chunks)]
    scored.sort(reverse=True)
    return [chunk for _, chunk in scored[:top_k]]

4. Generate with context

Inject the top chunks into the prompt and ask Claude to answer.

def ask(question, context_chunks):
    context = "\n".join(f"- {c}" for c in context_chunks)
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}",
        }],
    )
    return response.content[0].text

top_chunks = retrieve("How do I define a function?", chunks, chunk_embeddings)
print(ask("How do I define a function?", top_chunks))

See examples/rag_manual.py for the full runnable version. No vector database โ€” just pip install anthropic.

To-do list

Learn

  • Understand the difference between parametric knowledge and retrieved evidence.
  • Learn why retrieval quality often dominates final answer quality in RAG systems.
  • Study the major failure modes: retrieval miss, context dilution, unsupported synthesis, and stale evidence.
  • Learn how grounding, citations, abstention, recall, and precision relate to trustworthy answers.

Practice

  • Build a tiny corpus and inspect the retrieved evidence for at least ten realistic user questions.
  • Write follow-up questions that rely on chat history and test whether query rewriting improves retrieval.
  • Create adversarial prompts where the answer is absent and verify that the system abstains cleanly.
  • Compare answers with and without citations and judge which version is easier to verify.

Build

  • Create a source-grounded assistant over a custom document set with citation output.
  • Add a retrieval log that records the query, returned passages, and final answer for debugging.
  • Implement a small evaluation set with expected evidence and expected abstentions.
  • Add a confidence gate that declines answers when retrieved support is weak or contradictory.