Core concept
One-hot vs dense embeddings
One-hot vectors are sparse and encode no similarity: every word is the same distance from every other. Dense embeddings compress words into low-dimensional vectors so words used in similar contexts land near each other.
One-hot "cat" -> [0, 0, 0, 1, 0, 0, ...] sparse, orthogonal, no similarity Embedding "cat" -> [0.21, -0.44, 0.83, ...] dense, cosine similarity is meaningful Distributional hypothesis: words with similar contexts get similar vectors.
- Dimensionality: typically 50β300 dimensions vs vocabulary-size one-hot vectors.
- Pre-trained: embeddings trained on large corpora can be reused as features (transfer learning before transformers).
- Static: one vector per word type regardless of sentence context β the core limitation compared to contextual models.
Word2Vec
Two training objectives
Word2Vec is a shallow two-layer neural network trained on raw text. The weights of the embedding layer become the learned vectors. Two variants exist and interviewers frequently ask you to compare them.
| Variant | Task | Strength |
|---|---|---|
| CBOW (Continuous Bag of Words) | Predict the center word from surrounding context words. | Faster, works better with frequent words and large datasets. |
| Skip-gram | Predict surrounding context words from the center word. | Better for rare words and smaller datasets; slower to train. |
CBOW: [the, __, sat] -> "cat" (context -> center) Skip-gram: "cat" -> [the, sat] (center -> context)
Negative sampling (key training trick)
Training a softmax over the entire vocabulary is expensive. Negative sampling turns the problem into binary classification: for each real context pair, sample k random "negative" words and train the model to score real pairs higher than random ones. Typical k = 5β20 for small datasets, 2β5 for large.
# Conceptual objective (not exact code)
# Maximize: log sigmoid(v_context Β· v_center)
# + sum of k negatives: log sigmoid(-v_neg Β· v_center)
# Hierarchical softmax is an alternative: uses a binary tree
# over vocabulary β O(log V) instead of O(V) per update.
Interview point: negative sampling is the reason Word2Vec scales to billions of words. Interviewers may ask why naive softmax is infeasible.
Hyperparameters that matter
- Window size: small windows (2β5) capture syntactic relationships; large windows (10+) capture more topical/semantic similarity.
- Embedding dimension: 100β300 is common; larger is not always better.
- Subsampling frequent words: words like "the" are randomly dropped during training to balance signal.
GloVe
Global co-occurrence factorization
GloVe (Global Vectors) builds an explicit wordβword co-occurrence matrix from the whole corpus, then factorizes it so that the dot product of two word vectors approximates the log of their co-occurrence probability.
Step 1: Build co-occurrence matrix X (X_ij = how often word j appears near word i) Step 2: Minimize: sum_ij f(X_ij) (v_i Β· v_j + b_i + b_j - log X_ij)^2 f(X_ij) = weighting function that down-weights very frequent pairs.
- Difference from Word2Vec: GloVe uses global corpus statistics directly; Word2Vec learns from local context windows iteratively.
- In practice: both produce similar quality vectors; GloVe is often faster to train on a single machine once the matrix is built.
- Still static: one vector per word, same polysemy limitation as Word2Vec.
fastText
Subword character n-grams
fastText (Facebook, 2016) represents each word as the sum of its character n-gram vectors. For example, "playing" is decomposed into <pla, lay, ayi, yin, ing, ng> plus the full word token.
- OOV handling: unknown words still get a vector by composing their n-grams β a major advantage over Word2Vec and GloVe.
- Morphology: "run", "runs", "running" share n-grams so their vectors are naturally close.
- Misspellings: "colour" and "color" share most n-grams, reducing sensitivity to spelling variation.
- Tradeoff: larger model size and longer training vs plain Word2Vec.
Interview point: when asked "how would you handle out-of-vocabulary words with static embeddings?", fastText is the standard answer.
Similarity and analogies
Cosine similarity
Vector direction is more informative than magnitude, so embeddings are compared with cosine similarity rather than Euclidean distance.
def cosine_similarity(a, b):
dot = sum(x * y for x, y in zip(a, b))
norm_a = sum(x * x for x in a) ** 0.5
norm_b = sum(x * x for x in b) ** 0.5
return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0
king = [0.20, 0.90, -0.10]
queen = [0.19, 0.88, -0.08]
apple = [-0.50, 0.11, 0.72]
print(cosine_similarity(king, queen)) # high (~0.999)
print(cosine_similarity(king, apple)) # low
Vector arithmetic and analogies
Well-trained embedding spaces exhibit linear regularities. The canonical example:
king - man + woman β queen Paris - France + Italy β Rome (capital-country relationship) bigger - big + small β smaller (comparative morphology)
Interview caveat: analogy arithmetic works on word2vec/GloVe benchmarks but is unreliable in practice. It is a useful intuition, not proof of semantic understanding. Bring this up if asked β it shows depth.
Key limitations β interview must-knows
| Limitation | Example | Fix |
|---|---|---|
| Polysemy | "bank" gets one vector for river-bank and financial-bank. | Contextual embeddings (ELMo, BERT). |
| OOV words | Unknown words have no vector in Word2Vec/GloVe. | fastText subword vectors. |
| Bias | Gender/racial stereotypes from the training corpus appear in neighbors and analogies. | Debiasing post-processing or curated training data. |
| Frequency imbalance | Rare words get weaker vectors due to less training signal. | Subsampling, subword methods, or more data. |
| No context | Same vector regardless of sentence position. | Contextual models (ELMo, BERT, GPT). |
Quick comparison
| Method | Objective | OOV | Key strength |
|---|---|---|---|
| Word2Vec CBOW | Predict center from context. | No | Fast, good for frequent words. |
| Word2Vec Skip-gram | Predict context from center. | No | Better for rare words. |
| GloVe | Factorize global co-occurrence matrix. | No | Uses corpus-wide statistics. |
| fastText | Skip-gram on character n-grams. | Yes | Handles morphology and OOV. |
Static vs contextual: all four methods above are static β one vector per word type. ELMo, BERT, and GPT produce contextual embeddings where the vector for a word changes depending on its sentence. This distinction is almost always tested.