Subject 18

Word embeddings

Word embeddings map words into dense numeric vectors so geometry reflects how words are used in text. Understanding them is essential for NLP interviews β€” you will be asked to compare methods, explain training objectives, and identify failure modes.

Core concept

One-hot vs dense embeddings

One-hot vectors are sparse and encode no similarity: every word is the same distance from every other. Dense embeddings compress words into low-dimensional vectors so words used in similar contexts land near each other.

One-hot  "cat" -> [0, 0, 0, 1, 0, 0, ...]   sparse, orthogonal, no similarity
Embedding "cat" -> [0.21, -0.44, 0.83, ...]  dense, cosine similarity is meaningful

Distributional hypothesis: words with similar contexts get similar vectors.

Word2Vec

Two training objectives

Word2Vec is a shallow two-layer neural network trained on raw text. The weights of the embedding layer become the learned vectors. Two variants exist and interviewers frequently ask you to compare them.

Variant Task Strength
CBOW (Continuous Bag of Words) Predict the center word from surrounding context words. Faster, works better with frequent words and large datasets.
Skip-gram Predict surrounding context words from the center word. Better for rare words and smaller datasets; slower to train.
CBOW:     [the, __, sat]  -> "cat"   (context -> center)
Skip-gram: "cat"          -> [the, sat]  (center -> context)

Negative sampling (key training trick)

Training a softmax over the entire vocabulary is expensive. Negative sampling turns the problem into binary classification: for each real context pair, sample k random "negative" words and train the model to score real pairs higher than random ones. Typical k = 5–20 for small datasets, 2–5 for large.

# Conceptual objective (not exact code)
# Maximize: log sigmoid(v_context Β· v_center)
# +  sum of k negatives: log sigmoid(-v_neg Β· v_center)

# Hierarchical softmax is an alternative: uses a binary tree
# over vocabulary β€” O(log V) instead of O(V) per update.

Interview point: negative sampling is the reason Word2Vec scales to billions of words. Interviewers may ask why naive softmax is infeasible.

Hyperparameters that matter

GloVe

Global co-occurrence factorization

GloVe (Global Vectors) builds an explicit word–word co-occurrence matrix from the whole corpus, then factorizes it so that the dot product of two word vectors approximates the log of their co-occurrence probability.

Step 1: Build co-occurrence matrix X  (X_ij = how often word j appears near word i)
Step 2: Minimize:  sum_ij f(X_ij) (v_i Β· v_j + b_i + b_j - log X_ij)^2

f(X_ij) = weighting function that down-weights very frequent pairs.

fastText

Subword character n-grams

fastText (Facebook, 2016) represents each word as the sum of its character n-gram vectors. For example, "playing" is decomposed into <pla, lay, ayi, yin, ing, ng> plus the full word token.

Interview point: when asked "how would you handle out-of-vocabulary words with static embeddings?", fastText is the standard answer.

Similarity and analogies

Cosine similarity

Vector direction is more informative than magnitude, so embeddings are compared with cosine similarity rather than Euclidean distance.

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = sum(x * x for x in a) ** 0.5
    norm_b = sum(x * x for x in b) ** 0.5
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

king  = [0.20,  0.90, -0.10]
queen = [0.19,  0.88, -0.08]
apple = [-0.50, 0.11,  0.72]

print(cosine_similarity(king, queen))   # high (~0.999)
print(cosine_similarity(king, apple))   # low

Vector arithmetic and analogies

Well-trained embedding spaces exhibit linear regularities. The canonical example:

king - man + woman β‰ˆ queen

Paris - France + Italy β‰ˆ Rome   (capital-country relationship)
bigger - big + small  β‰ˆ smaller (comparative morphology)

Interview caveat: analogy arithmetic works on word2vec/GloVe benchmarks but is unreliable in practice. It is a useful intuition, not proof of semantic understanding. Bring this up if asked β€” it shows depth.

Key limitations β€” interview must-knows

Limitation Example Fix
Polysemy "bank" gets one vector for river-bank and financial-bank. Contextual embeddings (ELMo, BERT).
OOV words Unknown words have no vector in Word2Vec/GloVe. fastText subword vectors.
Bias Gender/racial stereotypes from the training corpus appear in neighbors and analogies. Debiasing post-processing or curated training data.
Frequency imbalance Rare words get weaker vectors due to less training signal. Subsampling, subword methods, or more data.
No context Same vector regardless of sentence position. Contextual models (ELMo, BERT, GPT).

Quick comparison

Method Objective OOV Key strength
Word2Vec CBOW Predict center from context. No Fast, good for frequent words.
Word2Vec Skip-gram Predict context from center. No Better for rare words.
GloVe Factorize global co-occurrence matrix. No Uses corpus-wide statistics.
fastText Skip-gram on character n-grams. Yes Handles morphology and OOV.

Static vs contextual: all four methods above are static β€” one vector per word type. ELMo, BERT, and GPT produce contextual embeddings where the vector for a word changes depending on its sentence. This distinction is almost always tested.