NLP Then vs Now
What's Changed in the LLM Era?
LLMs handle most of the heavy lifting now, but foundational techniques still give you an edge — whether it's debugging a model, optimising performance, or implementing pragmatic solutions where full-scale LLMs are overkill.
| Task | Classical Approach | Modern Approach (2025) |
|---|---|---|
| Text Classification | BoW + Logistic Regression / Naive Bayes | Fine-tuned BERT / LLM zero-shot |
| Named Entity Recognition | CRFs / HMMs with handcrafted features | Transformer-based NER (spaCy, HuggingFace) |
| Machine Translation | Seq2Seq with Attention | LLMs / Multilingual Transformers |
| Sentiment Analysis | TF-IDF + SVM / Lexicon-based | LLM prompting / Fine-tuned models |
| Search / Retrieval | BM25 / TF-IDF | Dense retrieval + RAG |
| Text Generation | RNN / LSTM language models | GPT-style autoregressive LLMs |
Text Preprocessing: Still Essential
Text Preprocessing Pipeline
Even in the LLM era, data quality matters. Garbage in, garbage out.
- Tokenization — breaking text into words, phrases, or subwords. SentencePiece and HuggingFace FastTokenizers are widely used
- Stopword Removal — removing common words like "is" and "the". Modern models often handle this implicitly, but it still helps in traditional pipelines
- Normalization — converting text to a standard format: lowercasing, stemming, lemmatization, handling special characters
- Text Cleaning — removing unwanted symbols, HTML tags, and noise
- Handling Emojis & Special Characters — some applications replace emojis with text descriptions to retain meaning
Raw text: "<p>I LOVE this product!!! 😊</p>" Step 1 — Clean: "I LOVE this product!!!" (remove HTML) Step 2 — Normalize: "i love this product" (lowercase, strip punctuation) Step 3 — Tokenize: ["i", "love", "this", "product"] (split into tokens) Step 4 — Stopwords: ["love", "product"] (remove "i", "this")
Feature Extraction and Representation
Bag of Words (BoW) and TF-IDF
Once text is preprocessed, it needs to be converted into a numerical format that ML models can process.
Bag of Words (BoW)
Represents text by counting word occurrences, ignoring grammar and order. Simple but effective for traditional NLP tasks like text classification.
TF-IDF
Improves upon BoW by assigning higher weights to rare but important words and lower weights to common words.
Related Concepts
- N-grams & Skip-grams — instead of single words, capture short sequences to provide more context
- Sparse Matrices — since most texts only contain a small subset of all possible words, sparse representations (like CSR format) save memory
- Feature Hashing — converts words into fixed-length numerical features, reducing memory for large datasets
When to use: document classification, sentiment analysis, and search applications where deep learning isn't feasible or isn't justified by the problem scale.
Word Embeddings
BoW and TF-IDF treat words as independent entities, ignoring relationships between them. Word embeddings represent words in a continuous vector space, capturing their meanings and relationships.
Word2Vec (Google, 2013)
Learns word meanings based on context using two methods:
- Continuous Bag of Words (CBOW) — predicts a word from its surrounding context. Faster to train, works well with frequent words
- Skip-gram — predicts the surrounding context given a word. Works better for infrequent words
CBOW: context words → [model] → target word
"the ___ sat" → predicts "cat"
Skip-gram: target word → [model] → context words
"cat" → predicts "the", "sat", "on", "mat"
GloVe (Stanford)
Instead of predicting words, GloVe learns from word co-occurrence matrices. It captures global statistics of word relationships, making it more stable in some cases.
FastText (Facebook)
Improves Word2Vec by using subword information (character n-grams), making it better at handling rare words or different word forms (e.g. "running" and "runner").
When to use: Word2Vec and GloVe are great for smaller tasks where deep models aren't needed. FastText is useful for languages with rich morphology or rare words.
Deep Learning for NLP
Recurrent Neural Networks (RNNs)
RNNs process text sequentially, maintaining memory of previous words. However, they suffer from the vanishing gradient problem — important information from earlier parts of a sequence gets lost as it travels through long chains of computations.
Input: x₁ → x₂ → x₃ → ... → xₙ
↓ ↓ ↓ ↓
State: h₁ → h₂ → h₃ → ... → hₙ → output
Each hidden state depends on the previous one
→ sequential, cannot be parallelised
→ gradients shrink over long sequences (vanishing gradient)
LSTM (Long Short-Term Memory)
LSTMs solve the vanishing gradient problem using gates:
- Input gate — decides what new information to add to the cell state
- Forget gate — removes irrelevant information from the cell state
- Output gate — decides what information to output from the cell state
GRU (Gated Recurrent Unit)
GRUs are a simplified version of LSTMs, using fewer parameters (two gates instead of three) while maintaining similar performance. Faster to train.
When to use: LSTMs/GRUs are useful for text generation, speech recognition, and sequence modelling where Transformers may be too heavy.
CNNs for NLP
While CNNs were originally designed for image processing, they are surprisingly effective for NLP tasks like text classification, sentiment analysis, and NER. Instead of processing sequences word by word like RNNs, CNNs detect patterns using filters.
- 1D Convolutions — capture relationships between nearby words
- Character-level CNNs — work directly on characters, useful for noisy data (tweets, misspellings)
- CNN-RNN Hybrid Models — combine CNNs for feature extraction with RNNs for sequential dependencies
Input: "I love this movie" Filters slide across the text: [I love] → feature 1 [love this] → feature 2 [this movie] → feature 3 Pool the strongest signals → classification
CNNs were fast and efficient, making them a popular choice for NLP before Transformers took over.
Sequence-to-Sequence (Seq2Seq) Models
The Seq2Seq architecture was a game-changer for translation, text summarisation, and chatbots. It introduced an encoder-decoder structure where the encoder processes input text into a hidden representation, and the decoder generates output text step by step.
Encoder Decoder "I love NLP" → [h₁, h₂, h₃] → context vector → "J'adore le TAL" → <EOS>
Key Improvements
- Attention Mechanisms — helped models focus on relevant words in longer sequences instead of compressing everything into one vector
- Beam Search — allowed models to explore multiple output candidates, generating more fluent text
Limitations
- Still suffered from long-term dependency issues on very long sequences
- Sequential processing — difficult to parallelise, making training slow
These limitations directly motivated the development of the Transformer.
The Transformer: A New Era
In 2017, Transformers changed NLP forever. Introduced in the paper "Attention Is All You Need" (Vaswani et al.), Transformers solved everything Seq2Seq struggled with:
- Self-Attention Mechanism — lets models look at all words in a sentence simultaneously, rather than step by step
- Positional Encoding — preserves word order without requiring sequential processing
- Massive Parallelisation — made training incredibly fast and efficient on GPUs
Evolution of NLP Architectures: RNNs (sequential) ↓ solved vanishing gradients LSTMs / GRUs (gated sequential) ↓ added attention to encoder-decoder Seq2Seq + Attention (still sequential encoder) ↓ replaced recurrence entirely with self-attention Transformers (fully parallel) ↓ scaled up BERT, GPT, T5, LLMs (modern era)
This led to models like BERT (encoder-only, understanding tasks), GPT (decoder-only, generation tasks), and T5 (encoder-decoder, both), which now power almost every modern NLP system.
Summary
When to Use What
| Technique | Best For | Still Relevant? |
|---|---|---|
| BoW / TF-IDF | Simple classification, search, baselines | Yes — fast, interpretable, good baselines |
| Word2Vec / GloVe | Small-scale semantic tasks, analogy tests | Somewhat — replaced by contextual embeddings |
| FastText | Morphologically rich languages, rare words | Yes — subword handling still useful |
| RNN / LSTM / GRU | Sequence modelling, time series | Niche — mostly replaced by Transformers |
| CNNs for NLP | Fast text classification | Rare — Transformers are better |
| Seq2Seq + Attention | Translation, summarisation | Conceptually — Transformers evolved from this |
| Transformers / LLMs | Almost everything | Dominant approach in 2025 |
Key takeaway: understanding classical techniques helps you work better with modern AI. They provide fast baselines, interpretable results, and the conceptual foundation that Transformers and LLMs are built upon.