Beginner
What information retrieval actually does
Information retrieval asks a narrower question than generation: given a query and a collection of stored items, which items should be returned, and in what order? The output is usually a ranked list of documents, passages, products, emails, tickets, or web pages, not a free-form answer.
A good retrieval system is not just about finding any match. It must surface the most useful material early, because users often inspect only the first few results. That is why ranking is central to IR rather than an afterthought.
Collection of documents -> build searchable representation -> user query -> score candidates -> rank results -> user inspects top hits
Relevance is contextual
Relevance is not an absolute property of a document. It depends on the user intent behind the query. A query like "python" might mean the programming language, the animal, or an internal project codename. IR systems work well only when they handle this ambiguity deliberately.
- Known-item search: the user wants one specific document, page, or record.
- Topical search: the user wants documents about a subject, not one exact item.
- Exploratory search: the user is still learning what to ask and benefits from broader coverage.
Real-world example: searching an internal wiki for "maternity leave" should still retrieve the correct policy document even if the official wording now says "parental leave." That is a retrieval problem before it is a generation problem.
Core goals: match, rank, and cover
Three beginner ideas explain most of IR:
- Precision: how many returned results are truly relevant.
- Recall: whether the system found most of the relevant material that exists.
- Ranking quality: whether the best results appear near the top where users will see them.
Important boundary: this module is about retrieving and ranking information. It is not the same as chunking documents, storing vectors, training embedding models, or generating grounded answers. Those topics belong in their own modules.
Lexical retrieval intuition
The oldest and still very useful family of IR methods is lexical retrieval: score documents using the words they share with the query. This works especially well for exact phrases, names, numbers, error codes, product IDs, and domain-specific terminology.
query_terms = {"password", "reset"}
documents = {
"doc_a": {"how", "to", "reset", "your", "password"},
"doc_b": {"account", "security", "multi", "factor"},
"doc_c": {"reset", "mfa", "after", "device", "loss"},
}
scores = {
doc_id: len(query_terms & doc_terms)
for doc_id, doc_terms in documents.items()
}
print(sorted(scores.items(), key=lambda item: item[1], reverse=True))
This toy example uses raw overlap only, but real systems improve on it by rewarding rare terms more than very common terms and by normalizing for document length.
What makes retrieval hard
- Vocabulary mismatch: the right document may use different words than the query.
- Ambiguity: one query string may correspond to multiple intents.
- Long-tail identifiers: SKUs, error codes, and names require exactness.
- Ranking trade-offs: broad coverage can help recall but hurt top-result precision.
- User patience: even a decent result set is poor if the best hit is buried too low.
Advanced
Anatomy of a modern retrieval stack
Production retrieval systems are often multi-stage. The first stage aims for strong recall and retrieves a manageable candidate set quickly. Later stages spend more computation on better ranking, filtering, deduplication, diversification, and business rules.
Query -> normalize / rewrite -> first-pass retrieval -> candidate set -> re-rank -> apply filters / tie-breaks -> top-k results
This architecture matters because the system usually cannot compare every document against every query at full precision in real time. Candidate generation narrows the search space; reranking improves the final order.
Retrieval methods and when they help
| Method | Best at | Main limitation |
|---|---|---|
| Boolean retrieval | Hard filters and must-match conditions | Too rigid for partial relevance and paraphrase |
| TF-IDF / BM25 | Exact terms, rare tokens, identifiers, transparent scoring | Weak on synonymy and deeper semantics |
| Dense retrieval | Paraphrases, semantic similarity, softer lexical mismatch | Can miss exact IDs, names, or domain strings |
| Hybrid | Combining lexical precision with semantic recall | Fusion, normalization, and tuning get more complex |
| Cross-encoder re-ranker | High-precision ordering on a small candidate list | Too slow to score an entire corpus directly |
Why BM25 still matters
BM25 remains one of the most important IR baselines because it captures several practical truths at once:
- Rare query terms should count more than very common terms.
- Repeated term matches help, but with diminishing returns.
- Long documents should not win just because they contain more words overall.
That combination makes BM25 extremely strong for operational search problems involving policy titles, part numbers, log messages, legal phrases, medication names, or version strings.
def hybrid_score(bm25_score, dense_score, alpha=0.7):
return alpha * bm25_score + (1 - alpha) * dense_score
bm25 = 12.4
dense = 0.81
# In practice, scores usually need calibration or rank-based fusion.
print(hybrid_score(bm25, dense, alpha=0.7))
Common IR failure modes
- Query mismatch: the wording of the query does not align with the wording of relevant documents.
- Exact-match miss: the system underweights a rare identifier that should dominate ranking.
- Popularity bias: generally popular documents crowd out the most query-specific ones.
- Over-normalization: aggressive stemming or normalization collapses distinct meanings.
- Near-duplicate flooding: many similar documents occupy the top ranks and reduce result diversity.
- Poor evaluation coverage: the benchmark queries do not reflect real traffic or hard cases.
Evaluation for retrieval systems
IR evaluation depends on relevance judgments: for a set of benchmark queries, you need to know which documents are relevant and, ideally, how relevant they are. Without judgments, you are tuning by intuition rather than evidence.
Evaluation loop: 1. Collect realistic queries 2. Label relevant documents 3. Score the ranked lists 4. Inspect the misses and bad top results 5. Change retrieval or ranking logic 6. Re-run evaluation on the same benchmark
Recall at k, MRR, and nDCG are common IR metrics, but this module should keep the focus on what they mean for ranking decisions rather than turning into a full metrics lesson. The dedicated evaluation module covers the metric details separately.
Scope boundaries that keep the curriculum clean
Keep adjacent modules separate: embeddings explain representation, vector databases explain storage and ANN infrastructure, chunking explains document segmentation, and RAG explains grounded answer generation. Information retrieval itself is about query understanding, candidate retrieval, ranking, relevance, and evaluation of ranked results.
To-do list
Learn
- Understand the difference between matching documents and ranking them well.
- Learn the meaning of precision, recall, and rank-sensitive relevance.
- Study known-item, topical, and exploratory search behavior.
- Learn why BM25 is still a strong baseline in modern retrieval systems.
- Understand candidate generation, reranking, deduplication, and filtering as separate stages.
- Learn why labeled relevance judgments are required for serious retrieval evaluation.
Practice
- Implement a toy lexical ranker and inspect why some documents score above others.
- Create a small labeled query set with exact-match, paraphrase, and ambiguous queries.
- Compare lexical, dense, and hybrid retrieval on the same benchmark and note where each fails.
- Test queries with abbreviations, policy numbers, error codes, and synonyms.
- Measure whether reranking improves the top three results without hurting latency too much.
- Inspect failure cases where relevant documents exist but appear too low in the ranking.
Build
- Create a small search demo with a first-pass retriever and a second-pass reranker.
- Build an evaluation script that reports recall at k, MRR, or nDCG on a labeled query set.
- Add a debugging view that shows returned documents, scores, and final rank positions.
- Document which retrieval mode works best for identifiers, paraphrases, and ambiguous queries.
- Add a benchmark slice for hard cases such as acronyms, typos, and duplicate-looking documents.