Subject 15

Classical NLP Techniques That Still Matter

From BoW to Transformers — a dive into the foundational techniques that shaped modern NLP. While LLMs now handle most heavy lifting, understanding these methods gives you an edge in debugging models, optimising performance, and solving smaller-scale problems efficiently.

NLP Then vs Now

What's Changed in the LLM Era?

LLMs handle most of the heavy lifting now, but foundational techniques still give you an edge — whether it's debugging a model, optimising performance, or implementing pragmatic solutions where full-scale LLMs are overkill.

Task Classical Approach Modern Approach (2025)
Text Classification BoW + Logistic Regression / Naive Bayes Fine-tuned BERT / LLM zero-shot
Named Entity Recognition CRFs / HMMs with handcrafted features Transformer-based NER (spaCy, HuggingFace)
Machine Translation Seq2Seq with Attention LLMs / Multilingual Transformers
Sentiment Analysis TF-IDF + SVM / Lexicon-based LLM prompting / Fine-tuned models
Search / Retrieval BM25 / TF-IDF Dense retrieval + RAG
Text Generation RNN / LSTM language models GPT-style autoregressive LLMs

Text Preprocessing: Still Essential

Text Preprocessing Pipeline

Even in the LLM era, data quality matters. Garbage in, garbage out.

Raw text: "<p>I LOVE this product!!! 😊</p>"

Step 1 — Clean:       "I LOVE this product!!!"        (remove HTML)
Step 2 — Normalize:   "i love this product"            (lowercase, strip punctuation)
Step 3 — Tokenize:    ["i", "love", "this", "product"] (split into tokens)
Step 4 — Stopwords:   ["love", "product"]              (remove "i", "this")

Feature Extraction and Representation

Bag of Words (BoW) and TF-IDF

Once text is preprocessed, it needs to be converted into a numerical format that ML models can process.

Bag of Words (BoW)

Represents text by counting word occurrences, ignoring grammar and order. Simple but effective for traditional NLP tasks like text classification.

TF-IDF

Improves upon BoW by assigning higher weights to rare but important words and lower weights to common words.

Related Concepts

When to use: document classification, sentiment analysis, and search applications where deep learning isn't feasible or isn't justified by the problem scale.

Word Embeddings

BoW and TF-IDF treat words as independent entities, ignoring relationships between them. Word embeddings represent words in a continuous vector space, capturing their meanings and relationships.

Word2Vec (Google, 2013)

Learns word meanings based on context using two methods:

CBOW:      context words → [model] → target word
           "the ___ sat" → predicts "cat"

Skip-gram: target word → [model] → context words
           "cat" → predicts "the", "sat", "on", "mat"

GloVe (Stanford)

Instead of predicting words, GloVe learns from word co-occurrence matrices. It captures global statistics of word relationships, making it more stable in some cases.

FastText (Facebook)

Improves Word2Vec by using subword information (character n-grams), making it better at handling rare words or different word forms (e.g. "running" and "runner").

When to use: Word2Vec and GloVe are great for smaller tasks where deep models aren't needed. FastText is useful for languages with rich morphology or rare words.

Deep Learning for NLP

Recurrent Neural Networks (RNNs)

RNNs process text sequentially, maintaining memory of previous words. However, they suffer from the vanishing gradient problem — important information from earlier parts of a sequence gets lost as it travels through long chains of computations.

Input:  x₁ → x₂ → x₃ → ... → xₙ
         ↓     ↓     ↓            ↓
State:  h₁ → h₂ → h₃ → ... → hₙ  → output

Each hidden state depends on the previous one
→ sequential, cannot be parallelised
→ gradients shrink over long sequences (vanishing gradient)

LSTM (Long Short-Term Memory)

LSTMs solve the vanishing gradient problem using gates:

GRU (Gated Recurrent Unit)

GRUs are a simplified version of LSTMs, using fewer parameters (two gates instead of three) while maintaining similar performance. Faster to train.

When to use: LSTMs/GRUs are useful for text generation, speech recognition, and sequence modelling where Transformers may be too heavy.

CNNs for NLP

While CNNs were originally designed for image processing, they are surprisingly effective for NLP tasks like text classification, sentiment analysis, and NER. Instead of processing sequences word by word like RNNs, CNNs detect patterns using filters.

Input: "I love this movie"

Filters slide across the text:
  [I love]     → feature 1
  [love this]  → feature 2
  [this movie] → feature 3

Pool the strongest signals → classification

CNNs were fast and efficient, making them a popular choice for NLP before Transformers took over.

Sequence-to-Sequence (Seq2Seq) Models

The Seq2Seq architecture was a game-changer for translation, text summarisation, and chatbots. It introduced an encoder-decoder structure where the encoder processes input text into a hidden representation, and the decoder generates output text step by step.

Encoder                              Decoder
"I love NLP" → [h₁, h₂, h₃] → context vector → "J'adore le TAL" → <EOS>

Key Improvements

Limitations

These limitations directly motivated the development of the Transformer.

The Transformer: A New Era

In 2017, Transformers changed NLP forever. Introduced in the paper "Attention Is All You Need" (Vaswani et al.), Transformers solved everything Seq2Seq struggled with:

Evolution of NLP Architectures:

RNNs (sequential)
  ↓ solved vanishing gradients
LSTMs / GRUs (gated sequential)
  ↓ added attention to encoder-decoder
Seq2Seq + Attention (still sequential encoder)
  ↓ replaced recurrence entirely with self-attention
Transformers (fully parallel)
  ↓ scaled up
BERT, GPT, T5, LLMs (modern era)

This led to models like BERT (encoder-only, understanding tasks), GPT (decoder-only, generation tasks), and T5 (encoder-decoder, both), which now power almost every modern NLP system.

Summary

When to Use What

Technique Best For Still Relevant?
BoW / TF-IDF Simple classification, search, baselines Yes — fast, interpretable, good baselines
Word2Vec / GloVe Small-scale semantic tasks, analogy tests Somewhat — replaced by contextual embeddings
FastText Morphologically rich languages, rare words Yes — subword handling still useful
RNN / LSTM / GRU Sequence modelling, time series Niche — mostly replaced by Transformers
CNNs for NLP Fast text classification Rare — Transformers are better
Seq2Seq + Attention Translation, summarisation Conceptually — Transformers evolved from this
Transformers / LLMs Almost everything Dominant approach in 2025

Key takeaway: understanding classical techniques helps you work better with modern AI. They provide fast baselines, interpretable results, and the conceptual foundation that Transformers and LLMs are built upon.