Classical NLP Techniques

What's Changed in the LLM Era?

LLMs handle most of the heavy lifting now, but foundational techniques still give you an edge — whether it's debugging a model, optimising performance, or implementing pragmatic solutions where full-scale LLMs are overkill.

Task	Classical Approach	Modern Approach (2025)
Text Classification	BoW + Logistic Regression / Naive Bayes	Fine-tuned BERT / LLM zero-shot
Named Entity Recognition	CRFs / HMMs with handcrafted features	Transformer-based NER (spaCy, HuggingFace)
Machine Translation	Seq2Seq with Attention	LLMs / Multilingual Transformers
Sentiment Analysis	TF-IDF + SVM / Lexicon-based	LLM prompting / Fine-tuned models
Search / Retrieval	BM25 / TF-IDF	Dense retrieval + RAG
Text Generation	RNN / LSTM language models	GPT-style autoregressive LLMs

Text Preprocessing Pipeline

Even in the LLM era, data quality matters. Garbage in, garbage out.

Tokenization — breaking text into words, phrases, or subwords. SentencePiece and HuggingFace FastTokenizers are widely used
Stopword Removal — removing common words like "is" and "the". Modern models often handle this implicitly, but it still helps in traditional pipelines
Normalization — converting text to a standard format: lowercasing, stemming, lemmatization, handling special characters
Text Cleaning — removing unwanted symbols, HTML tags, and noise
Handling Emojis & Special Characters — some applications replace emojis with text descriptions to retain meaning

Raw text: "<p>I LOVE this product!!! 😊</p>"

Step 1 — Clean:       "I LOVE this product!!!"        (remove HTML)
Step 2 — Normalize:   "i love this product"            (lowercase, strip punctuation)
Step 3 — Tokenize:    ["i", "love", "this", "product"] (split into tokens)
Step 4 — Stopwords:   ["love", "product"]              (remove "i", "this")

Bag of Words (BoW) and TF-IDF

Once text is preprocessed, it needs to be converted into a numerical format that ML models can process.

Bag of Words (BoW)

Represents text by counting word occurrences, ignoring grammar and order. Simple but effective for traditional NLP tasks like text classification.

TF-IDF

Improves upon BoW by assigning higher weights to rare but important words and lower weights to common words.

Related Concepts

N-grams & Skip-grams — instead of single words, capture short sequences to provide more context
Sparse Matrices — since most texts only contain a small subset of all possible words, sparse representations (like CSR format) save memory
Feature Hashing — converts words into fixed-length numerical features, reducing memory for large datasets

When to use: document classification, sentiment analysis, and search applications where deep learning isn't feasible or isn't justified by the problem scale.

Word Embeddings

BoW and TF-IDF treat words as independent entities, ignoring relationships between them. Word embeddings represent words in a continuous vector space, capturing their meanings and relationships.

Word2Vec (Google, 2013)

Learns word meanings based on context using two methods:

Continuous Bag of Words (CBOW) — predicts a word from its surrounding context. Faster to train, works well with frequent words
Skip-gram — predicts the surrounding context given a word. Works better for infrequent words

CBOW:      context words → [model] → target word
           "the ___ sat" → predicts "cat"

Skip-gram: target word → [model] → context words
           "cat" → predicts "the", "sat", "on", "mat"

GloVe (Stanford)

Instead of predicting words, GloVe learns from word co-occurrence matrices. It captures global statistics of word relationships, making it more stable in some cases.

FastText (Facebook)

Improves Word2Vec by using subword information (character n-grams), making it better at handling rare words or different word forms (e.g. "running" and "runner").

When to use: Word2Vec and GloVe are great for smaller tasks where deep models aren't needed. FastText is useful for languages with rich morphology or rare words.

Recurrent Neural Networks (RNNs)

RNNs process text sequentially, maintaining memory of previous words. However, they suffer from the vanishing gradient problem — important information from earlier parts of a sequence gets lost as it travels through long chains of computations.

Input:  x₁ → x₂ → x₃ → ... → xₙ
         ↓     ↓     ↓            ↓
State:  h₁ → h₂ → h₃ → ... → hₙ  → output

Each hidden state depends on the previous one
→ sequential, cannot be parallelised
→ gradients shrink over long sequences (vanishing gradient)

LSTM (Long Short-Term Memory)

LSTMs solve the vanishing gradient problem using gates:

Input gate — decides what new information to add to the cell state
Forget gate — removes irrelevant information from the cell state
Output gate — decides what information to output from the cell state

GRU (Gated Recurrent Unit)

GRUs are a simplified version of LSTMs, using fewer parameters (two gates instead of three) while maintaining similar performance. Faster to train.

When to use: LSTMs/GRUs are useful for text generation, speech recognition, and sequence modelling where Transformers may be too heavy.

CNNs for NLP

While CNNs were originally designed for image processing, they are surprisingly effective for NLP tasks like text classification, sentiment analysis, and NER. Instead of processing sequences word by word like RNNs, CNNs detect patterns using filters.

1D Convolutions — capture relationships between nearby words
Character-level CNNs — work directly on characters, useful for noisy data (tweets, misspellings)
CNN-RNN Hybrid Models — combine CNNs for feature extraction with RNNs for sequential dependencies

Input: "I love this movie"

Filters slide across the text:
  [I love]     → feature 1
  [love this]  → feature 2
  [this movie] → feature 3

Pool the strongest signals → classification

CNNs were fast and efficient, making them a popular choice for NLP before Transformers took over.

Sequence-to-Sequence (Seq2Seq) Models

The Seq2Seq architecture was a game-changer for translation, text summarisation, and chatbots. It introduced an encoder-decoder structure where the encoder processes input text into a hidden representation, and the decoder generates output text step by step.

Encoder                              Decoder
"I love NLP" → [h₁, h₂, h₃] → context vector → "J'adore le TAL" → <EOS>

Key Improvements

Attention Mechanisms — helped models focus on relevant words in longer sequences instead of compressing everything into one vector
Beam Search — allowed models to explore multiple output candidates, generating more fluent text

Limitations

Still suffered from long-term dependency issues on very long sequences
Sequential processing — difficult to parallelise, making training slow

These limitations directly motivated the development of the Transformer.

The Transformer: A New Era

In 2017, Transformers changed NLP forever. Introduced in the paper "Attention Is All You Need" (Vaswani et al.), Transformers solved everything Seq2Seq struggled with:

Self-Attention Mechanism — lets models look at all words in a sentence simultaneously, rather than step by step
Positional Encoding — preserves word order without requiring sequential processing
Massive Parallelisation — made training incredibly fast and efficient on GPUs

Evolution of NLP Architectures:

RNNs (sequential)
  ↓ solved vanishing gradients
LSTMs / GRUs (gated sequential)
  ↓ added attention to encoder-decoder
Seq2Seq + Attention (still sequential encoder)
  ↓ replaced recurrence entirely with self-attention
Transformers (fully parallel)
  ↓ scaled up
BERT, GPT, T5, LLMs (modern era)

This led to models like BERT (encoder-only, understanding tasks), GPT (decoder-only, generation tasks), and T5 (encoder-decoder, both), which now power almost every modern NLP system.

When to Use What

Technique	Best For	Still Relevant?
BoW / TF-IDF	Simple classification, search, baselines	Yes — fast, interpretable, good baselines
Word2Vec / GloVe	Small-scale semantic tasks, analogy tests	Somewhat — replaced by contextual embeddings
FastText	Morphologically rich languages, rare words	Yes — subword handling still useful
RNN / LSTM / GRU	Sequence modelling, time series	Niche — mostly replaced by Transformers
CNNs for NLP	Fast text classification	Rare — Transformers are better
Seq2Seq + Attention	Translation, summarisation	Conceptually — Transformers evolved from this
Transformers / LLMs	Almost everything	Dominant approach in 2025

Key takeaway: understanding classical techniques helps you work better with modern AI. They provide fast baselines, interpretable results, and the conceptual foundation that Transformers and LLMs are built upon.