Text Representation Methods

Bag of Words (BoW)

The simplest way to convert unstructured text into a structured numeric format. Each word in the text is considered a feature, and the number of times a word appears represents its importance. Grammar and word order are disregarded — only frequency is kept.

Example

Consider 3 sentences:

The cat in the hat
The dog in the house
The bird in the sky

         bird  cat  dog  hat  house  in  sky  the
Sent 1:   0     1    0    1    0     1    0    2
Sent 2:   0     0    1    0    1     1    0    2
Sent 3:   1     0    0    0    0     1    1    2

from sklearn.feature_extraction.text import CountVectorizer

sentences = ["The cat in the hat",
             "The dog in the house",
             "The bird in the sky"]

vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names())
print(bow.toarray())

Strengths

Simple and fast to compute
Works well for keyword-driven tasks (spam detection, topic classification)

Weaknesses

Loses word order — "dog bites man" and "man bites dog" produce the same vector
No semantic meaning — all words treated equally
Sparse, high-dimensional vectors that grow with vocabulary size

N-grams

An N-gram breaks text into contiguous sequences of N words. This preserves some local word order that BoW discards entirely.

Example

Sentence: "The dog in the house"

Uni-gram (N=1):  "The", "dog", "in", "the", "house"

Bi-gram  (N=2):  "The dog", "dog in", "in the", "the house"

Tri-gram (N=3):  "The dog in", "dog in the", "in the house"

Strengths

Captures local context and phrases (e.g. "not good" as a bigram is very different from "good" alone)
Useful for detecting negation, collocations, and multi-word expressions

Weaknesses

Feature space explodes as N grows — vocabulary becomes enormous
Still no semantic meaning — similar phrases with different wording are unrelated
Sparse representation

TF-IDF (Term Frequency – Inverse Document Frequency)

TF-IDF improves on BoW by weighting words based on their importance. Words that appear frequently in one document but rarely across all documents get higher scores. Common words like "the" or "is" are downweighted.

Formula

TF(t, d) = (Number of times term t appears in document d)
            ÷ (Total number of terms in document d)

IDF(t) = log(Total number of documents ÷ Number of documents containing term t)

TF-IDF(t, d) = TF(t, d) × IDF(t)

Intuition

A word that appears often in ONE document but rarely in others → high TF-IDF (important, distinctive word)
A word that appears in almost every document → low TF-IDF (common, uninformative word)

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["The cat jumped",
        "The white tiger roared",
        "Bird flying in the sky"]

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names())
print(tfidf.toarray())

Strengths

Highlights informative, distinctive terms
Better than raw counts for document ranking and retrieval

Weaknesses

Still sparse and lexical — no semantic similarity
Synonyms are treated as completely unrelated words
Ignores word order

Word Embedding

Word embeddings represent each word as a dense vector of real numbers such that semantically similar or related words are closer together in vector space. This is achieved by training a neural network on a large corpus of text, where the model learns to predict surrounding words. The semantic meaning of each word is captured in its vector.

Key Properties

Dense, low-dimensional vectors (typically 50–300 dimensions for Word2Vec/GloVe, thousands for language models)
Similar words cluster together in the vector space
Captures semantic relationships: King − Man + Woman ≈ Queen

Popular Methods

Word2Vec (Google, 2013) — CBOW and Skip-gram architectures
GloVe (Stanford) — based on word co-occurrence statistics
fastText (Facebook) — extends Word2Vec with subword (character n-gram) information

from gensim.models import Word2Vec

corpus = ["The cat jumped",
          "The white tiger roared",
          "Bird flying in the sky"]
corpus = [sent.split(" ") for sent in corpus]

# Train a Word2Vec model
model = Word2Vec(corpus, vector_size=50, window=5, min_count=1, workers=2)

# Get the vector for a word
vector = model.wv["cat"]

# Find most similar words
similar_words = model.wv.most_similar("cat", topn=5)

Limitation

One fixed vector per word regardless of context — "bank" (river) and "bank" (money) get the same embedding. Contextual embeddings (BERT, GPT) solve this.

Sentence Embedding

Similar to word embeddings, but an entire sentence is represented as a single numerical vector in a high-dimensional space. The goal is to capture the meaning and semantic relationships between words in a sentence, as well as the context in which the sentence is used.

Popular Methods

Doc2Vec — extends Word2Vec to learn paragraph/sentence-level vectors
Sentence-BERT (SBERT) — fine-tunes BERT to produce meaningful sentence embeddings
Universal Sentence Encoder — pre-trained model from Google for sentence-level tasks

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

sentences = ["The cat jumped",
             "The white tiger roared",
             "Bird flying in the sky"]

tagged_data = [TaggedDocument(words=sentence.split(), tags=[str(i)])
               for i, sentence in enumerate(sentences)]

model = Doc2Vec(tagged_data, vector_size=50, min_count=1, epochs=10)

# Infer embedding for a new sentence
embedding = model.infer_vector("The white tiger roared".split())
print(embedding)

Use Cases

Semantic similarity and paraphrase detection
Clustering similar sentences or support tickets
Retrieval and search systems

Document Embedding

Document embedding represents an entire document (paragraph, article, or book) as a single vector. It captures not only the meaning and context of individual sentences but also the relationships and coherence between sentences within the document.

The technique is similar to sentence embedding but applied to longer multi-sentence texts.

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [
    "This is the first document. It has multiple sentences.",
    "This is the second document. It is shorter than the first.",
    "This is the third document. It is longer and has multiple paragraphs."
]

tagged_data = [TaggedDocument(words=doc.split(), tags=[str(i)])
               for i, doc in enumerate(documents)]

model = Doc2Vec(tagged_data, vector_size=50, min_count=1, epochs=10)

# Get the embedding for the first document
embedding = model.docvecs[0]
print(embedding)

Use Cases

Document classification and topic modelling
Information retrieval across large corpora
Plagiarism detection and document similarity

Evolution of Text Representation

Traditional (Sparse)                    Modern (Dense)
─────────────────────                   ─────────────────
BoW         → word counts, no order     Word Embedding  → dense, semantic, per-word
N-grams     → local word sequences      Sentence Embed. → dense, per-sentence
TF-IDF      → weighted word importance  Document Embed. → dense, per-document

Technique	Granularity	Captures Semantics?	Captures Order?	Dimensionality
Bag of Words	Document	No	No	Sparse, high
N-grams	Document	No	Local only	Sparse, very high
TF-IDF	Document	No	No	Sparse, high
Word Embedding	Word	Yes	No (static)	Dense, low (50–300)
Sentence Embedding	Sentence	Yes	Yes (contextual)	Dense, low (256–768)
Document Embedding	Document	Yes	Yes (contextual)	Dense, low (50–768)

Rule of thumb: if the task depends mostly on whether certain terms appear, sparse representations (BoW, TF-IDF) can work very well. If the task depends on similarity, paraphrase, or semantic matching, dense embeddings usually help more. Many production systems blend both — a sparse lexical signal with a compact dense signal.