Traditional / Sparse Methods
Bag of Words (BoW)
The simplest way to convert unstructured text into a structured numeric format. Each word in the text is considered a feature, and the number of times a word appears represents its importance. Grammar and word order are disregarded โ only frequency is kept.
Example
Consider 3 sentences:
- The cat in the hat
- The dog in the house
- The bird in the sky
bird cat dog hat house in sky the Sent 1: 0 1 0 1 0 1 0 2 Sent 2: 0 0 1 0 1 1 0 2 Sent 3: 1 0 0 0 0 1 1 2
from sklearn.feature_extraction.text import CountVectorizer
sentences = ["The cat in the hat",
"The dog in the house",
"The bird in the sky"]
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names())
print(bow.toarray())
Strengths
- Simple and fast to compute
- Works well for keyword-driven tasks (spam detection, topic classification)
Weaknesses
- Loses word order โ "dog bites man" and "man bites dog" produce the same vector
- No semantic meaning โ all words treated equally
- Sparse, high-dimensional vectors that grow with vocabulary size
N-grams
An N-gram breaks text into contiguous sequences of N words. This preserves some local word order that BoW discards entirely.
Example
Sentence: "The dog in the house"
Uni-gram (N=1): "The", "dog", "in", "the", "house" Bi-gram (N=2): "The dog", "dog in", "in the", "the house" Tri-gram (N=3): "The dog in", "dog in the", "in the house"
Strengths
- Captures local context and phrases (e.g. "not good" as a bigram is very different from "good" alone)
- Useful for detecting negation, collocations, and multi-word expressions
Weaknesses
- Feature space explodes as N grows โ vocabulary becomes enormous
- Still no semantic meaning โ similar phrases with different wording are unrelated
- Sparse representation
TF-IDF (Term Frequency – Inverse Document Frequency)
TF-IDF improves on BoW by weighting words based on their importance. Words that appear frequently in one document but rarely across all documents get higher scores. Common words like "the" or "is" are downweighted.
Formula
TF(t, d) = (Number of times term t appears in document d)
รท (Total number of terms in document d)
IDF(t) = log(Total number of documents รท Number of documents containing term t)
TF-IDF(t, d) = TF(t, d) ร IDF(t)
Intuition
- A word that appears often in ONE document but rarely in others โ high TF-IDF (important, distinctive word)
- A word that appears in almost every document โ low TF-IDF (common, uninformative word)
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["The cat jumped",
"The white tiger roared",
"Bird flying in the sky"]
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names())
print(tfidf.toarray())
Strengths
- Highlights informative, distinctive terms
- Better than raw counts for document ranking and retrieval
Weaknesses
- Still sparse and lexical โ no semantic similarity
- Synonyms are treated as completely unrelated words
- Ignores word order
Dense / Modern Methods
Word Embedding
Word embeddings represent each word as a dense vector of real numbers such that semantically similar or related words are closer together in vector space. This is achieved by training a neural network on a large corpus of text, where the model learns to predict surrounding words. The semantic meaning of each word is captured in its vector.
Key Properties
- Dense, low-dimensional vectors (typically 50โ300 dimensions for Word2Vec/GloVe, thousands for language models)
- Similar words cluster together in the vector space
- Captures semantic relationships:
King โ Man + Woman โ Queen
Popular Methods
- Word2Vec (Google, 2013) โ CBOW and Skip-gram architectures
- GloVe (Stanford) โ based on word co-occurrence statistics
- fastText (Facebook) โ extends Word2Vec with subword (character n-gram) information
from gensim.models import Word2Vec
corpus = ["The cat jumped",
"The white tiger roared",
"Bird flying in the sky"]
corpus = [sent.split(" ") for sent in corpus]
# Train a Word2Vec model
model = Word2Vec(corpus, vector_size=50, window=5, min_count=1, workers=2)
# Get the vector for a word
vector = model.wv["cat"]
# Find most similar words
similar_words = model.wv.most_similar("cat", topn=5)
Limitation
One fixed vector per word regardless of context โ "bank" (river) and "bank" (money) get the same embedding. Contextual embeddings (BERT, GPT) solve this.
Sentence Embedding
Similar to word embeddings, but an entire sentence is represented as a single numerical vector in a high-dimensional space. The goal is to capture the meaning and semantic relationships between words in a sentence, as well as the context in which the sentence is used.
Popular Methods
- Doc2Vec โ extends Word2Vec to learn paragraph/sentence-level vectors
- Sentence-BERT (SBERT) โ fine-tunes BERT to produce meaningful sentence embeddings
- Universal Sentence Encoder โ pre-trained model from Google for sentence-level tasks
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
sentences = ["The cat jumped",
"The white tiger roared",
"Bird flying in the sky"]
tagged_data = [TaggedDocument(words=sentence.split(), tags=[str(i)])
for i, sentence in enumerate(sentences)]
model = Doc2Vec(tagged_data, vector_size=50, min_count=1, epochs=10)
# Infer embedding for a new sentence
embedding = model.infer_vector("The white tiger roared".split())
print(embedding)
Use Cases
- Semantic similarity and paraphrase detection
- Clustering similar sentences or support tickets
- Retrieval and search systems
Document Embedding
Document embedding represents an entire document (paragraph, article, or book) as a single vector. It captures not only the meaning and context of individual sentences but also the relationships and coherence between sentences within the document.
The technique is similar to sentence embedding but applied to longer multi-sentence texts.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [
"This is the first document. It has multiple sentences.",
"This is the second document. It is shorter than the first.",
"This is the third document. It is longer and has multiple paragraphs."
]
tagged_data = [TaggedDocument(words=doc.split(), tags=[str(i)])
for i, doc in enumerate(documents)]
model = Doc2Vec(tagged_data, vector_size=50, min_count=1, epochs=10)
# Get the embedding for the first document
embedding = model.docvecs[0]
print(embedding)
Use Cases
- Document classification and topic modelling
- Information retrieval across large corpora
- Plagiarism detection and document similarity
Comparison
Evolution of Text Representation
Traditional (Sparse) Modern (Dense) โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ BoW โ word counts, no order Word Embedding โ dense, semantic, per-word N-grams โ local word sequences Sentence Embed. โ dense, per-sentence TF-IDF โ weighted word importance Document Embed. โ dense, per-document
| Technique | Granularity | Captures Semantics? | Captures Order? | Dimensionality |
|---|---|---|---|---|
| Bag of Words | Document | No | No | Sparse, high |
| N-grams | Document | No | Local only | Sparse, very high |
| TF-IDF | Document | No | No | Sparse, high |
| Word Embedding | Word | Yes | No (static) | Dense, low (50โ300) |
| Sentence Embedding | Sentence | Yes | Yes (contextual) | Dense, low (256โ768) |
| Document Embedding | Document | Yes | Yes (contextual) | Dense, low (50โ768) |
Rule of thumb: if the task depends mostly on whether certain terms appear, sparse representations (BoW, TF-IDF) can work very well. If the task depends on similarity, paraphrase, or semantic matching, dense embeddings usually help more. Many production systems blend both โ a sparse lexical signal with a compact dense signal.