Subject 09

Information retrieval

Information retrieval is the discipline of finding and ranking useful documents for a query. It powers search engines, enterprise search, legal and academic search, recommendation surfaces, and the retrieval layer underneath many modern AI systems.

Beginner

What information retrieval actually does

Information retrieval asks a narrower question than generation: given a query and a collection of stored items, which items should be returned, and in what order? The output is usually a ranked list of documents, passages, products, emails, tickets, or web pages, not a free-form answer.

A good retrieval system is not just about finding any match. It must surface the most useful material early, because users often inspect only the first few results. That is why ranking is central to IR rather than an afterthought.

Collection of documents -> build searchable representation -> user query -> score candidates -> rank results -> user inspects top hits

Relevance is contextual

Relevance is not an absolute property of a document. It depends on the user intent behind the query. A query like "python" might mean the programming language, the animal, or an internal project codename. IR systems work well only when they handle this ambiguity deliberately.

Real-world example: searching an internal wiki for "maternity leave" should still retrieve the correct policy document even if the official wording now says "parental leave." That is a retrieval problem before it is a generation problem.

Core goals: match, rank, and cover

Three beginner ideas explain most of IR:

Important boundary: this module is about retrieving and ranking information. It is not the same as chunking documents, storing vectors, training embedding models, or generating grounded answers. Those topics belong in their own modules.

Lexical retrieval intuition

The oldest and still very useful family of IR methods is lexical retrieval: score documents using the words they share with the query. This works especially well for exact phrases, names, numbers, error codes, product IDs, and domain-specific terminology.

query_terms = {"password", "reset"}
documents = {
    "doc_a": {"how", "to", "reset", "your", "password"},
    "doc_b": {"account", "security", "multi", "factor"},
    "doc_c": {"reset", "mfa", "after", "device", "loss"},
}

scores = {
    doc_id: len(query_terms & doc_terms)
    for doc_id, doc_terms in documents.items()
}

print(sorted(scores.items(), key=lambda item: item[1], reverse=True))

This toy example uses raw overlap only, but real systems improve on it by rewarding rare terms more than very common terms and by normalizing for document length.

What makes retrieval hard

Advanced

Anatomy of a modern retrieval stack

Production retrieval systems are often multi-stage. The first stage aims for strong recall and retrieves a manageable candidate set quickly. Later stages spend more computation on better ranking, filtering, deduplication, diversification, and business rules.

Query -> normalize / rewrite -> first-pass retrieval -> candidate set -> re-rank -> apply filters / tie-breaks -> top-k results

This architecture matters because the system usually cannot compare every document against every query at full precision in real time. Candidate generation narrows the search space; reranking improves the final order.

Retrieval methods and when they help

Method Best at Main limitation
Boolean retrieval Hard filters and must-match conditions Too rigid for partial relevance and paraphrase
TF-IDF / BM25 Exact terms, rare tokens, identifiers, transparent scoring Weak on synonymy and deeper semantics
Dense retrieval Paraphrases, semantic similarity, softer lexical mismatch Can miss exact IDs, names, or domain strings
Hybrid Combining lexical precision with semantic recall Fusion, normalization, and tuning get more complex
Cross-encoder re-ranker High-precision ordering on a small candidate list Too slow to score an entire corpus directly

Why BM25 still matters

BM25 remains one of the most important IR baselines because it captures several practical truths at once:

That combination makes BM25 extremely strong for operational search problems involving policy titles, part numbers, log messages, legal phrases, medication names, or version strings.

def hybrid_score(bm25_score, dense_score, alpha=0.7):
    return alpha * bm25_score + (1 - alpha) * dense_score

bm25 = 12.4
dense = 0.81

# In practice, scores usually need calibration or rank-based fusion.
print(hybrid_score(bm25, dense, alpha=0.7))

Common IR failure modes

Evaluation for retrieval systems

IR evaluation depends on relevance judgments: for a set of benchmark queries, you need to know which documents are relevant and, ideally, how relevant they are. Without judgments, you are tuning by intuition rather than evidence.

Evaluation loop:

1. Collect realistic queries
2. Label relevant documents
3. Score the ranked lists
4. Inspect the misses and bad top results
5. Change retrieval or ranking logic
6. Re-run evaluation on the same benchmark

Recall at k, MRR, and nDCG are common IR metrics, but this module should keep the focus on what they mean for ranking decisions rather than turning into a full metrics lesson. The dedicated evaluation module covers the metric details separately.

Scope boundaries that keep the curriculum clean

Keep adjacent modules separate: embeddings explain representation, vector databases explain storage and ANN infrastructure, chunking explains document segmentation, and RAG explains grounded answer generation. Information retrieval itself is about query understanding, candidate retrieval, ranking, relevance, and evaluation of ranked results.

To-do list

Learn

  • Understand the difference between matching documents and ranking them well.
  • Learn the meaning of precision, recall, and rank-sensitive relevance.
  • Study known-item, topical, and exploratory search behavior.
  • Learn why BM25 is still a strong baseline in modern retrieval systems.
  • Understand candidate generation, reranking, deduplication, and filtering as separate stages.
  • Learn why labeled relevance judgments are required for serious retrieval evaluation.

Practice

  • Implement a toy lexical ranker and inspect why some documents score above others.
  • Create a small labeled query set with exact-match, paraphrase, and ambiguous queries.
  • Compare lexical, dense, and hybrid retrieval on the same benchmark and note where each fails.
  • Test queries with abbreviations, policy numbers, error codes, and synonyms.
  • Measure whether reranking improves the top three results without hurting latency too much.
  • Inspect failure cases where relevant documents exist but appear too low in the ranking.

Build

  • Create a small search demo with a first-pass retriever and a second-pass reranker.
  • Build an evaluation script that reports recall at k, MRR, or nDCG on a labeled query set.
  • Add a debugging view that shows returned documents, scores, and final rank positions.
  • Document which retrieval mode works best for identifiers, paraphrases, and ambiguous queries.
  • Add a benchmark slice for hard cases such as acronyms, typos, and duplicate-looking documents.