Subject 07

Embeddings

Embeddings are dense numerical representations that map text or other objects into vector spaces where useful similarity becomes measurable. They are a core building block for semantic search, retrieval, clustering, recommendations, classification, and many modern LLM application pipelines.

Beginner

What an embedding actually is

An embedding maps an object such as a word, sentence, paragraph, document, query, or item description to a list of numbers. That list is not useful because of any single coordinate by itself. It is useful because the full vector places the object inside a geometric space where distance or angle can reflect relatedness.

raw text -> embedding model -> dense vector -> compare in vector space

"reset my password"        -> [0.18, -0.07, 0.41, ...]
"forgot my login details" -> [0.16, -0.05, 0.39, ...]

nearby vectors -> likely similar intent
farther vectors -> likely different meaning or topic

This matters because exact word overlap is often too weak for real applications. Two texts can mean nearly the same thing while sharing few surface words.

Where embeddings help

Embeddings are used whenever you need similarity-aware behavior rather than exact matching alone.

Use case How embeddings help
Semantic search Retrieve relevant text even when the wording differs from the query
Clustering Group texts or items by meaning instead of by exact shared vocabulary
Recommendations Find related items, products, or documents from nearby vectors
Classification features Represent each input compactly before a downstream classifier makes a decision
Deduplication and anomaly detection Detect near-duplicates or outliers from neighborhood structure

Similarity in practice

Cosine similarity is common because it focuses on direction rather than raw magnitude. Many text embedding setups normalize vectors so cosine similarity, dot product, and Euclidean ranking behavior become closely related or even identical.

import math

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(y * y for y in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

print(cosine_similarity([1, 2], [2, 4]))

Real-world example: How do I reset my password? and I forgot my login credentials share limited lexical overlap but should still land close together in a support system.

What the vector represents

Not every embedding represents the same kind of object. The vector might stand for a single token, a short query, a sentence, or a whole document. That granularity changes what information is preserved and what is blurred away.

Advanced

Embedding spaces are shaped by the training objective

Embeddings are not generic truth machines. The geometry is learned from an objective. If a model is trained to pull matched query-document pairs together, it may be excellent for retrieval but only average for unsupervised clustering. If it is trained on broad semantic similarity, it may group related topics well but fail to separate very fine-grained distinctions your application needs.

Design choices that affect quality

Choice Why it matters Common mistake
Embedding granularity Sentence vectors and document vectors capture different levels of meaning Using one long-document vector when answer-bearing details are local
Symmetric vs asymmetric matching A short query and a long answer passage are not always best served by identical training assumptions Treating query-to-document search the same as document-to-document similarity
Pooling strategy How token-level signals become one vector affects what information survives Assuming every mean-pooled embedding preserves key phrases equally well
Truncation Long inputs may be cut before embedding, silently dropping important context Embedding long documents without checking token limits or lost sections
Metric and normalization Similarity ranking depends on how vectors are compared and whether they are normalized Mixing cosine, dot-product, and raw unnormalized vectors without validation

Symmetric and asymmetric use cases

Symmetric similarity
    sentence A <-> sentence B
    goal: are these semantically close?

Asymmetric retrieval
    short query -> longer document or passage
    goal: does this document answer the query well?

same vector idea, different evaluation behavior

This distinction matters because a model that groups similar descriptions together may still underperform on search tasks where short user queries must find longer, answer-bearing passages.

What good evaluation looks like

Embedding quality should be judged by behavior on realistic nearest-neighbor tasks, not by whether example pairs look intuitively reasonable. Good evaluation uses the actual objects, language, and mistakes your system will face.

pairs = [
    ("reset password", "how to change my password", 1),
    ("reset password", "shipping times for Europe", 0),
    ("invoice download", "billing history and invoice export", 1),
    ("invoice download", "cancel account permanently", 0),
]

for query, text, label in pairs:
    print({"query": query, "candidate": text, "relevant": label})

Useful checks include whether relevant items appear near the top, whether hard negatives are separated well, and whether the model still works after domain-specific terms, abbreviations, or multilingual inputs are introduced.

input text -> embedding model -> vector geometry -> task behavior

good geometry for your use case means:
    relevant items rank high
    hard negatives stay separated
    domain terms map sensibly
    long-text compression does not erase key meaning


good geometry is measured by outcomes, not aesthetics

Common failure modes

This module is about embeddings themselves: what they represent, how vector geometry is used, and which practical factors affect quality. Keep index structures and database operations in the vector database module, keep chunk design in the chunking/indexing module, and keep Word2Vec, GloVe, fastText, and static lexical-vector history in the word embeddings module.

To-do list

Learn

  • Understand what an embedding represents and why dense vector geometry is useful.
  • Learn cosine similarity, dot product, and the role of normalization.
  • Study the difference between sentence-level, query-level, and document-level embeddings.
  • Understand why embedding quality depends on training objective, domain, and task.
  • Learn the difference between symmetric similarity and asymmetric retrieval use cases.

Practice

  • Generate embeddings for a small corpus and inspect nearest neighbors for both good matches and bad matches.
  • Compare cosine similarity scores for paraphrases, related-but-not-equivalent pairs, and unrelated pairs.
  • Create hard negatives that share topic words but do not answer the query, then inspect ranking behavior.
  • Test whether short queries behave differently from full-sentence similarity queries.
  • Compare two embedding models on the same labeled retrieval set and note where each one fails.

Build

  • Create a small semantic-similarity explorer that shows top neighbors and similarity scores for a query.
  • Build a duplicate-detection tool that separates exact duplicates, near duplicates, and merely related texts.
  • Add a mini evaluation set with positives and hard negatives, then measure whether relevant items stay near the top.
  • Write a short embedding selection note explaining which model, similarity metric, and text granularity you would use for your task.