NLP Evaluation Metrics

Classification Metrics

Confusion Matrix

All classification metrics are derived from four counts in the confusion matrix:

                     Predicted Positive    Predicted Negative
Actual Positive           TP                      FN
Actual Negative           FP                      TN

TP = True Positive   — correctly predicted positive
FP = False Positive  — predicted positive, but actually negative (Type I error)
FN = False Negative  — predicted negative, but actually positive (Type II error)
TN = True Negative   — correctly predicted negative

Accuracy

Accuracy measures the fraction of all predictions that are correct.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Example

100 emails: 90 legitimate, 10 spam
Model predicts ALL as "legitimate"

TP = 0, TN = 90, FP = 0, FN = 10
Accuracy = (0 + 90) / 100 = 90%

90% accuracy — but it caught ZERO spam!

When Accuracy Misleads

Accuracy is misleading with imbalanced datasets. If 99% of transactions are legitimate, a model that always predicts "not fraud" gets 99% accuracy while catching zero fraud. Use precision, recall, and F1 instead.

Use accuracy when: classes are roughly balanced and all errors are equally costly.

Precision

Precision answers: "When the model says positive, how often is it actually correct?"

Precision = TP / (TP + FP)

Example

Model flags 20 emails as spam
15 are actually spam (TP = 15)
5 are legitimate (FP = 5)

Precision = 15 / (15 + 5) = 15 / 20 = 0.75 (75%)

When Precision Matters Most

Spam filtering — flagging a legitimate email as spam is very costly (user misses important mail)
Content moderation — wrongly removing safe content damages user trust
Medical diagnosis — false positive for a serious disease causes unnecessary stress and procedures

High precision = few false positives.

Recall

Recall (also called sensitivity or true positive rate) answers: "Of all actual positives, how many did the model find?"

Recall = TP / (TP + FN)

Example

Dataset has 20 actual spam emails
Model catches 15 of them (TP = 15)
Model misses 5 (FN = 5)

Recall = 15 / (15 + 5) = 15 / 20 = 0.75 (75%)

When Recall Matters Most

Fraud detection — missing real fraud (false negative) means direct financial loss
Disease screening — missing a sick patient is far worse than a false alarm
Security threats — failing to detect an intrusion can be catastrophic

High recall = few false negatives.

F1 Score

The F1 score is the harmonic mean of precision and recall. It balances both metrics into a single number.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why Harmonic Mean?

The harmonic mean penalises extreme imbalances. If precision is 1.0 but recall is 0.01, the arithmetic mean would be 0.505 (looks okay), but the harmonic mean (F1) is 0.02 (correctly shows the model is bad).

Example

Precision = 0.75, Recall = 0.75

F1 = 2 × (0.75 × 0.75) / (0.75 + 0.75)
   = 2 × 0.5625 / 1.5
   = 0.75

Variants

Macro F1 — compute F1 per class, then average. Treats all classes equally (good when rare classes matter)
Micro F1 — aggregate TP, FP, FN across all classes, then compute F1. Weighted toward common classes
F-beta — generalised version where beta controls the weight. F2 weighs recall higher, F0.5 weighs precision higher

Use F1 when: you need to balance precision and recall, especially on imbalanced datasets where accuracy is misleading.

The Precision-Recall Trade-off

Increasing precision typically decreases recall, and vice versa. You cannot maximise both simultaneously.

High threshold → model only flags when very confident
  → High precision, low recall (misses many positives)

Low threshold → model flags anything slightly suspicious
  → High recall, low precision (many false alarms)

The right balance depends on the cost of each type of error.

Example: Fraud Detection at a Bank

A bank processes 10,000 transactions per day. 50 are actually fraudulent.

Scenario A: High Threshold (Favour Precision)

The model only flags transactions it is very confident about.

Flagged: 20 transactions
  18 are real fraud   (TP = 18)
  2 are legitimate    (FP = 2)
  32 fraud missed     (FN = 32)

Precision = 18 / (18+2)  = 90%   ← very few false alarms
Recall    = 18 / (18+32) = 36%   ← missed 64% of fraud!

Customers are rarely bothered by false blocks, but the bank loses money on 32 undetected fraudulent transactions.

Scenario B: Low Threshold (Favour Recall)

The model flags anything slightly suspicious.

Flagged: 500 transactions
  47 are real fraud   (TP = 47)
  453 are legitimate  (FP = 453)
  3 fraud missed      (FN = 3)

Precision = 47 / (47+453) = 9.4%  ← most flags are false alarms
Recall    = 47 / (47+3)   = 94%   ← catches almost all fraud

Almost all fraud is caught, but 453 legitimate customers have their transactions blocked — causing frustration and support calls.

Scenario C: Balanced Threshold

Flagged: 80 transactions
  40 are real fraud   (TP = 40)
  40 are legitimate   (FP = 40)
  10 fraud missed     (FN = 10)

Precision = 40 / (40+40) = 50%
Recall    = 40 / (40+10) = 80%
F1        = 2×(0.5×0.8) / (0.5+0.8) = 0.615

A reasonable balance — catches 80% of fraud while keeping false alarms manageable.

Which Is Best?

Scenario     Precision   Recall   F1      Trade-off
──────────────────────────────────────────────────────────────
A (high)     90%         36%      0.51    Few false alarms, misses most fraud
B (low)      9.4%        94%      0.17    Catches fraud, overwhelmed by false alarms
C (balanced) 50%         80%      0.62    Practical middle ground

For fraud: the bank would likely lean toward higher recall (Scenario B or C)
  because missing fraud costs more than investigating false alarms.

For spam filtering: lean toward higher precision (Scenario A)
  because blocking a real email is worse than letting some spam through.

Quick Reference

Metric       Formula                          Question it answers
─────────────────────────────────────────────────────────────────────────
Accuracy     (TP+TN) / (TP+TN+FP+FN)         How often is the model correct overall?
Precision    TP / (TP+FP)                     When it says positive, is it right?
Recall       TP / (TP+FN)                     Did it find all the actual positives?
F1           2×(P×R) / (P+R)                  Balance of precision and recall

Generation and NLP-Specific Metrics

BLEU (Bilingual Evaluation Understudy)

BLEU measures how much a generated text overlaps with one or more reference texts. Originally designed for machine translation.

How It Works

Compare n-gram overlaps (unigrams, bigrams, trigrams, 4-grams) between the generated output and the reference
Compute precision for each n-gram level — what fraction of generated n-grams appear in the reference
Apply a brevity penalty to discourage outputs that are too short
Combine into a single score between 0 and 1

Example

Reference:  "The cat is on the mat"
Generated:  "The cat sat on the mat"

Unigram matches:  "The", "cat", "on", "the", "mat" → 5/6 words match
Bigram matches:   "The cat", "on the", "the mat"   → 3/5 bigrams match

BLEU combines these n-gram precisions into one score.

Strengths and Weaknesses

Strength: fast, automatic, reproducible, widely used as a baseline
Weakness: purely lexical — synonyms score zero ("couch" vs "sofa"), ignores meaning and fluency
Weakness: a factually wrong sentence can score high if it shares many words with the reference

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE measures overlap between generated text and reference text, but focuses on recall rather than precision. Widely used for text summarisation.

Variants

ROUGE-1 — unigram (single word) overlap between generated and reference
ROUGE-2 — bigram overlap — captures phrase-level similarity
ROUGE-L — longest common subsequence (LCS) — captures sentence-level structure without requiring consecutive matches

Example

Reference: "The cat is sitting on the mat"
Generated: "The cat is on the mat"

ROUGE-1 (unigram recall):
  Reference words: {The, cat, is, sitting, on, the, mat} → 7 words
  Matched in generated: {The, cat, is, on, the, mat}     → 6 matches
  ROUGE-1 Recall = 6/7 = 0.857

ROUGE-2 (bigram recall):
  Reference bigrams: {The cat, cat is, is sitting, sitting on, on the, the mat}
  Matched: {The cat, cat is, on the, the mat} → 4/6
  ROUGE-2 Recall = 4/6 = 0.667

BLEU vs ROUGE

BLEU                              ROUGE
─────────────────────             ─────────────────────
Precision-oriented                Recall-oriented
"How much of the output           "How much of the reference
 is in the reference?"             is captured in the output?"
Used for: translation             Used for: summarisation

Perplexity

Perplexity measures how well a language model predicts a sequence of text. It answers: "How surprised is the model by this text?"

Intuition

Perplexity = 2^(cross-entropy loss)

Lower perplexity → model is less "surprised" → better predictions
Higher perplexity → model is more "confused" → worse predictions

Example:
  Perplexity = 10 means the model is as uncertain as choosing
  uniformly among 10 options at each token position.

What It Tells You

Good for comparing language models on the same test set — lower is better
Measures how well the model has learned the language patterns in the data

Limitations

A model with low perplexity can still hallucinate fluent but false text
Does not measure factual correctness, usefulness, or coherence
Cannot compare perplexity across models with different vocabularies or tokenisers

Word Error Rate (WER)

WER is the standard metric for speech recognition (ASR) and is also used for any task that compares a generated word sequence to a reference.

Formula

WER = (Substitutions + Insertions + Deletions) / Total words in reference

S = words that were replaced with a wrong word
I = extra words added that aren't in the reference
D = words from the reference that were missed

Example

Reference:  "the cat sat on the mat"     (6 words)
Predicted:  "the cat sit on a mat"

Differences:
  "sat" → "sit"     (1 Substitution)
  "the" → "a"       (1 Substitution)

WER = (2 + 0 + 0) / 6 = 2/6 = 0.333 (33.3%)

Properties

Lower is better — WER = 0% means perfect transcription
WER can exceed 100% if there are many insertions
Does not account for severity — replacing "not" with "now" is counted the same as replacing "the" with "a"

When to use: speech-to-text evaluation, OCR quality measurement, or any sequence comparison where edit distance matters.

Summary

Quick Reference Table

Metric	Used For	What It Measures	Key Limitation
Accuracy	Classification	Overall correctness	Misleading on imbalanced data
Precision	Classification	How many predicted positives are correct	Ignores missed positives (FN)
Recall	Classification	How many actual positives were found	Ignores false alarms (FP)
F1	Classification	Balance of precision and recall	Ignores true negatives
BLEU	Translation	N-gram precision vs reference	No semantic understanding
ROUGE	Summarisation	N-gram recall vs reference	No semantic understanding
Perplexity	Language modelling	How well the model predicts text	Doesn't measure factual correctness
WER	Speech recognition	Edit distance between predicted and reference	All errors weighted equally