Embeddings

Beginner

What an embedding actually is

An embedding maps an object such as a word, sentence, paragraph, document, query, or item description to a list of numbers. That list is not useful because of any single coordinate by itself. It is useful because the full vector places the object inside a geometric space where distance or angle can reflect relatedness.

raw text -> embedding model -> dense vector -> compare in vector space

"reset my password"        -> [0.18, -0.07, 0.41, ...]
"forgot my login details" -> [0.16, -0.05, 0.39, ...]

nearby vectors -> likely similar intent
farther vectors -> likely different meaning or topic

This matters because exact word overlap is often too weak for real applications. Two texts can mean nearly the same thing while sharing few surface words.

Dense representation: embeddings compress useful signal into a relatively small vector instead of a huge sparse representation.
Geometry matters: similar meaning should produce vectors with high similarity under the chosen metric.
Generalization: embeddings help systems match paraphrases, related concepts, and soft semantic overlap.
Task dependence: a good search embedding is not automatically a good clustering or classification embedding.

Where embeddings help

Embeddings are used whenever you need similarity-aware behavior rather than exact matching alone.

Use case	How embeddings help
Semantic search	Retrieve relevant text even when the wording differs from the query
Clustering	Group texts or items by meaning instead of by exact shared vocabulary
Recommendations	Find related items, products, or documents from nearby vectors
Classification features	Represent each input compactly before a downstream classifier makes a decision
Deduplication and anomaly detection	Detect near-duplicates or outliers from neighborhood structure

Similarity in practice

Cosine similarity is common because it focuses on direction rather than raw magnitude. Many text embedding setups normalize vectors so cosine similarity, dot product, and Euclidean ranking behavior become closely related or even identical.

import math

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(y * y for y in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

print(cosine_similarity([1, 2], [2, 4]))

Real-world example: How do I reset my password? and I forgot my login credentials share limited lexical overlap but should still land close together in a support system.

What the vector represents

Not every embedding represents the same kind of object. The vector might stand for a single token, a short query, a sentence, or a whole document. That granularity changes what information is preserved and what is blurred away.

Short-text embeddings: good for intent matching, short search queries, and titles.
Sentence embeddings: good when local meaning and phrasing both matter.
Document embeddings: useful for broad topic similarity, but can wash out small important details inside long texts.

Advanced

Embedding spaces are shaped by the training objective

Embeddings are not generic truth machines. The geometry is learned from an objective. If a model is trained to pull matched query-document pairs together, it may be excellent for retrieval but only average for unsupervised clustering. If it is trained on broad semantic similarity, it may group related topics well but fail to separate very fine-grained distinctions your application needs.

Domain mismatch: general-purpose embeddings may miss jargon, abbreviations, or subtle distinctions in legal, medical, scientific, or enterprise text.
Granularity mismatch: a single vector for a long document may flatten the few spans that actually matter.
Objective mismatch: nearest-neighbor quality depends on what kinds of positive and negative pairs shaped the space during training.
Operational choices: normalization, truncation, and pooling can change behavior even when the model stays the same.

Design choices that affect quality

Choice	Why it matters	Common mistake
Embedding granularity	Sentence vectors and document vectors capture different levels of meaning	Using one long-document vector when answer-bearing details are local
Symmetric vs asymmetric matching	A short query and a long answer passage are not always best served by identical training assumptions	Treating query-to-document search the same as document-to-document similarity
Pooling strategy	How token-level signals become one vector affects what information survives	Assuming every mean-pooled embedding preserves key phrases equally well
Truncation	Long inputs may be cut before embedding, silently dropping important context	Embedding long documents without checking token limits or lost sections
Metric and normalization	Similarity ranking depends on how vectors are compared and whether they are normalized	Mixing cosine, dot-product, and raw unnormalized vectors without validation

Symmetric and asymmetric use cases

Symmetric similarity
    sentence A <-> sentence B
    goal: are these semantically close?

Asymmetric retrieval
    short query -> longer document or passage
    goal: does this document answer the query well?

same vector idea, different evaluation behavior

This distinction matters because a model that groups similar descriptions together may still underperform on search tasks where short user queries must find longer, answer-bearing passages.

What good evaluation looks like

Embedding quality should be judged by behavior on realistic nearest-neighbor tasks, not by whether example pairs look intuitively reasonable. Good evaluation uses the actual objects, language, and mistakes your system will face.

pairs = [
    ("reset password", "how to change my password", 1),
    ("reset password", "shipping times for Europe", 0),
    ("invoice download", "billing history and invoice export", 1),
    ("invoice download", "cancel account permanently", 0),
]

for query, text, label in pairs:
    print({"query": query, "candidate": text, "relevant": label})

Useful checks include whether relevant items appear near the top, whether hard negatives are separated well, and whether the model still works after domain-specific terms, abbreviations, or multilingual inputs are introduced.

input text -> embedding model -> vector geometry -> task behavior

good geometry for your use case means:
    relevant items rank high
    hard negatives stay separated
    domain terms map sensibly
    long-text compression does not erase key meaning


good geometry is measured by outcomes, not aesthetics

Common failure modes

Topic instead of intent: two texts about the same area may look close even if only one actually answers the user request.
Length compression: one vector can blur together multiple ideas from a long document.
Jargon blindness: generic models may miss organization-specific terminology.
False confidence from toy examples: a few hand-picked matches can hide broader failure patterns.
Representation drift: changing preprocessing, pooling, or model version can alter neighborhood structure and break previous thresholds.

This module is about embeddings themselves: what they represent, how vector geometry is used, and which practical factors affect quality. Keep index structures and database operations in the vector database module, keep chunk design in the chunking/indexing module, and keep Word2Vec, GloVe, fastText, and static lexical-vector history in the word embeddings module.

To-do list

Learn

Understand what an embedding represents and why dense vector geometry is useful.
Learn cosine similarity, dot product, and the role of normalization.
Study the difference between sentence-level, query-level, and document-level embeddings.
Understand why embedding quality depends on training objective, domain, and task.
Learn the difference between symmetric similarity and asymmetric retrieval use cases.

Practice

Generate embeddings for a small corpus and inspect nearest neighbors for both good matches and bad matches.
Compare cosine similarity scores for paraphrases, related-but-not-equivalent pairs, and unrelated pairs.
Create hard negatives that share topic words but do not answer the query, then inspect ranking behavior.
Test whether short queries behave differently from full-sentence similarity queries.
Compare two embedding models on the same labeled retrieval set and note where each one fails.

Build

Create a small semantic-similarity explorer that shows top neighbors and similarity scores for a query.
Build a duplicate-detection tool that separates exact duplicates, near duplicates, and merely related texts.
Add a mini evaluation set with positives and hard negatives, then measure whether relevant items stay near the top.
Write a short embedding selection note explaining which model, similarity metric, and text granularity you would use for your task.