Subject 16

Text Representation Methods

Natural Language Processing (NLP) focuses on enabling machines to understand and process human language. Text representation is a crucial aspect of NLP that involves converting raw text data into a machine-readable numeric form. From traditional sparse approaches to modern dense embeddings, the chosen representation controls what information the model can use and what gets discarded.

Traditional / Sparse Methods

Bag of Words (BoW)

The simplest way to convert unstructured text into a structured numeric format. Each word in the text is considered a feature, and the number of times a word appears represents its importance. Grammar and word order are disregarded โ€” only frequency is kept.

Example

Consider 3 sentences:

  1. The cat in the hat
  2. The dog in the house
  3. The bird in the sky
         bird  cat  dog  hat  house  in  sky  the
Sent 1:   0     1    0    1    0     1    0    2
Sent 2:   0     0    1    0    1     1    0    2
Sent 3:   1     0    0    0    0     1    1    2
from sklearn.feature_extraction.text import CountVectorizer

sentences = ["The cat in the hat",
             "The dog in the house",
             "The bird in the sky"]

vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names())
print(bow.toarray())

Strengths

Weaknesses

N-grams

An N-gram breaks text into contiguous sequences of N words. This preserves some local word order that BoW discards entirely.

Example

Sentence: "The dog in the house"

Uni-gram (N=1):  "The", "dog", "in", "the", "house"

Bi-gram  (N=2):  "The dog", "dog in", "in the", "the house"

Tri-gram (N=3):  "The dog in", "dog in the", "in the house"

Strengths

Weaknesses

TF-IDF (Term Frequency – Inverse Document Frequency)

TF-IDF improves on BoW by weighting words based on their importance. Words that appear frequently in one document but rarely across all documents get higher scores. Common words like "the" or "is" are downweighted.

Formula

TF(t, d) = (Number of times term t appears in document d)
            รท (Total number of terms in document d)

IDF(t) = log(Total number of documents รท Number of documents containing term t)

TF-IDF(t, d) = TF(t, d) ร— IDF(t)

Intuition

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["The cat jumped",
        "The white tiger roared",
        "Bird flying in the sky"]

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names())
print(tfidf.toarray())

Strengths

Weaknesses

Dense / Modern Methods

Word Embedding

Word embeddings represent each word as a dense vector of real numbers such that semantically similar or related words are closer together in vector space. This is achieved by training a neural network on a large corpus of text, where the model learns to predict surrounding words. The semantic meaning of each word is captured in its vector.

Key Properties

Popular Methods

from gensim.models import Word2Vec

corpus = ["The cat jumped",
          "The white tiger roared",
          "Bird flying in the sky"]
corpus = [sent.split(" ") for sent in corpus]

# Train a Word2Vec model
model = Word2Vec(corpus, vector_size=50, window=5, min_count=1, workers=2)

# Get the vector for a word
vector = model.wv["cat"]

# Find most similar words
similar_words = model.wv.most_similar("cat", topn=5)

Limitation

One fixed vector per word regardless of context โ€” "bank" (river) and "bank" (money) get the same embedding. Contextual embeddings (BERT, GPT) solve this.

Sentence Embedding

Similar to word embeddings, but an entire sentence is represented as a single numerical vector in a high-dimensional space. The goal is to capture the meaning and semantic relationships between words in a sentence, as well as the context in which the sentence is used.

Popular Methods

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

sentences = ["The cat jumped",
             "The white tiger roared",
             "Bird flying in the sky"]

tagged_data = [TaggedDocument(words=sentence.split(), tags=[str(i)])
               for i, sentence in enumerate(sentences)]

model = Doc2Vec(tagged_data, vector_size=50, min_count=1, epochs=10)

# Infer embedding for a new sentence
embedding = model.infer_vector("The white tiger roared".split())
print(embedding)

Use Cases

Document Embedding

Document embedding represents an entire document (paragraph, article, or book) as a single vector. It captures not only the meaning and context of individual sentences but also the relationships and coherence between sentences within the document.

The technique is similar to sentence embedding but applied to longer multi-sentence texts.

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [
    "This is the first document. It has multiple sentences.",
    "This is the second document. It is shorter than the first.",
    "This is the third document. It is longer and has multiple paragraphs."
]

tagged_data = [TaggedDocument(words=doc.split(), tags=[str(i)])
               for i, doc in enumerate(documents)]

model = Doc2Vec(tagged_data, vector_size=50, min_count=1, epochs=10)

# Get the embedding for the first document
embedding = model.docvecs[0]
print(embedding)

Use Cases

Comparison

Evolution of Text Representation

Traditional (Sparse)                    Modern (Dense)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
BoW         โ†’ word counts, no order     Word Embedding  โ†’ dense, semantic, per-word
N-grams     โ†’ local word sequences      Sentence Embed. โ†’ dense, per-sentence
TF-IDF      โ†’ weighted word importance  Document Embed. โ†’ dense, per-document
Technique Granularity Captures Semantics? Captures Order? Dimensionality
Bag of Words Document No No Sparse, high
N-grams Document No Local only Sparse, very high
TF-IDF Document No No Sparse, high
Word Embedding Word Yes No (static) Dense, low (50โ€“300)
Sentence Embedding Sentence Yes Yes (contextual) Dense, low (256โ€“768)
Document Embedding Document Yes Yes (contextual) Dense, low (50โ€“768)

Rule of thumb: if the task depends mostly on whether certain terms appear, sparse representations (BoW, TF-IDF) can work very well. If the task depends on similarity, paraphrase, or semantic matching, dense embeddings usually help more. Many production systems blend both โ€” a sparse lexical signal with a compact dense signal.