Word Embeddings

Core concept

One-hot vs dense embeddings

One-hot vectors are sparse and encode no similarity: every word is the same distance from every other. Dense embeddings compress words into low-dimensional vectors so words used in similar contexts land near each other.

One-hot  "cat" -> [0, 0, 0, 1, 0, 0, ...]   sparse, orthogonal, no similarity
Embedding "cat" -> [0.21, -0.44, 0.83, ...]  dense, cosine similarity is meaningful

Distributional hypothesis: words with similar contexts get similar vectors.

Dimensionality: typically 50–300 dimensions vs vocabulary-size one-hot vectors.
Pre-trained: embeddings trained on large corpora can be reused as features (transfer learning before transformers).
Static: one vector per word type regardless of sentence context — the core limitation compared to contextual models.

Word2Vec

Two training objectives

Word2Vec is a shallow two-layer neural network trained on raw text. The weights of the embedding layer become the learned vectors. Two variants exist and interviewers frequently ask you to compare them.

Variant	Task	Strength
CBOW (Continuous Bag of Words)	Predict the center word from surrounding context words.	Faster, works better with frequent words and large datasets.
Skip-gram	Predict surrounding context words from the center word.	Better for rare words and smaller datasets; slower to train.

CBOW:     [the, __, sat]  -> "cat"   (context -> center)
Skip-gram: "cat"          -> [the, sat]  (center -> context)

Negative sampling (key training trick)

Training a softmax over the entire vocabulary is expensive. Negative sampling turns the problem into binary classification: for each real context pair, sample k random "negative" words and train the model to score real pairs higher than random ones. Typical k = 5–20 for small datasets, 2–5 for large.

# Conceptual objective (not exact code)
# Maximize: log sigmoid(v_context · v_center)
# +  sum of k negatives: log sigmoid(-v_neg · v_center)

# Hierarchical softmax is an alternative: uses a binary tree
# over vocabulary — O(log V) instead of O(V) per update.

Interview point: negative sampling is the reason Word2Vec scales to billions of words. Interviewers may ask why naive softmax is infeasible.

Hyperparameters that matter

Window size: small windows (2–5) capture syntactic relationships; large windows (10+) capture more topical/semantic similarity.
Embedding dimension: 100–300 is common; larger is not always better.
Subsampling frequent words: words like "the" are randomly dropped during training to balance signal.

GloVe

Global co-occurrence factorization

GloVe (Global Vectors) builds an explicit word–word co-occurrence matrix from the whole corpus, then factorizes it so that the dot product of two word vectors approximates the log of their co-occurrence probability.

Step 1: Build co-occurrence matrix X  (X_ij = how often word j appears near word i)
Step 2: Minimize:  sum_ij f(X_ij) (v_i · v_j + b_i + b_j - log X_ij)^2

f(X_ij) = weighting function that down-weights very frequent pairs.

Difference from Word2Vec: GloVe uses global corpus statistics directly; Word2Vec learns from local context windows iteratively.
In practice: both produce similar quality vectors; GloVe is often faster to train on a single machine once the matrix is built.
Still static: one vector per word, same polysemy limitation as Word2Vec.

fastText

Subword character n-grams

fastText (Facebook, 2016) represents each word as the sum of its character n-gram vectors. For example, "playing" is decomposed into <pla, lay, ayi, yin, ing, ng> plus the full word token.

OOV handling: unknown words still get a vector by composing their n-grams — a major advantage over Word2Vec and GloVe.
Morphology: "run", "runs", "running" share n-grams so their vectors are naturally close.
Misspellings: "colour" and "color" share most n-grams, reducing sensitivity to spelling variation.
Tradeoff: larger model size and longer training vs plain Word2Vec.

Interview point: when asked "how would you handle out-of-vocabulary words with static embeddings?", fastText is the standard answer.

Similarity and analogies

Cosine similarity

Vector direction is more informative than magnitude, so embeddings are compared with cosine similarity rather than Euclidean distance.

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = sum(x * x for x in a) ** 0.5
    norm_b = sum(x * x for x in b) ** 0.5
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

king  = [0.20,  0.90, -0.10]
queen = [0.19,  0.88, -0.08]
apple = [-0.50, 0.11,  0.72]

print(cosine_similarity(king, queen))   # high (~0.999)
print(cosine_similarity(king, apple))   # low

Vector arithmetic and analogies

Well-trained embedding spaces exhibit linear regularities. The canonical example:

king - man + woman ≈ queen

Paris - France + Italy ≈ Rome   (capital-country relationship)
bigger - big + small  ≈ smaller (comparative morphology)

Interview caveat: analogy arithmetic works on word2vec/GloVe benchmarks but is unreliable in practice. It is a useful intuition, not proof of semantic understanding. Bring this up if asked — it shows depth.

Key limitations — interview must-knows

Limitation	Example	Fix
Polysemy	"bank" gets one vector for river-bank and financial-bank.	Contextual embeddings (ELMo, BERT).
OOV words	Unknown words have no vector in Word2Vec/GloVe.	fastText subword vectors.
Bias	Gender/racial stereotypes from the training corpus appear in neighbors and analogies.	Debiasing post-processing or curated training data.
Frequency imbalance	Rare words get weaker vectors due to less training signal.	Subsampling, subword methods, or more data.
No context	Same vector regardless of sentence position.	Contextual models (ELMo, BERT, GPT).

Quick comparison

Method	Objective	OOV	Key strength
Word2Vec CBOW	Predict center from context.	No	Fast, good for frequent words.
Word2Vec Skip-gram	Predict context from center.	No	Better for rare words.
GloVe	Factorize global co-occurrence matrix.	No	Uses corpus-wide statistics.
fastText	Skip-gram on character n-grams.	Yes	Handles morphology and OOV.

Static vs contextual: all four methods above are static — one vector per word type. ELMo, BERT, and GPT produce contextual embeddings where the vector for a word changes depending on its sentence. This distinction is almost always tested.