Subject 25

NLP Evaluation Metrics

The key metrics used to evaluate NLP models — from classification metrics like accuracy, precision, recall, and F1 to generation metrics like BLEU, ROUGE, perplexity, and Word Error Rate.

Classification Metrics

Confusion Matrix

All classification metrics are derived from four counts in the confusion matrix:

                     Predicted Positive    Predicted Negative
Actual Positive           TP                      FN
Actual Negative           FP                      TN

TP = True Positive   — correctly predicted positive
FP = False Positive  — predicted positive, but actually negative (Type I error)
FN = False Negative  — predicted negative, but actually positive (Type II error)
TN = True Negative   — correctly predicted negative

Accuracy

Accuracy measures the fraction of all predictions that are correct.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Example

100 emails: 90 legitimate, 10 spam
Model predicts ALL as "legitimate"

TP = 0, TN = 90, FP = 0, FN = 10
Accuracy = (0 + 90) / 100 = 90%

90% accuracy — but it caught ZERO spam!

When Accuracy Misleads

Accuracy is misleading with imbalanced datasets. If 99% of transactions are legitimate, a model that always predicts "not fraud" gets 99% accuracy while catching zero fraud. Use precision, recall, and F1 instead.

Use accuracy when: classes are roughly balanced and all errors are equally costly.

Precision

Precision answers: "When the model says positive, how often is it actually correct?"

Precision = TP / (TP + FP)

Example

Model flags 20 emails as spam
15 are actually spam (TP = 15)
5 are legitimate (FP = 5)

Precision = 15 / (15 + 5) = 15 / 20 = 0.75 (75%)

When Precision Matters Most

High precision = few false positives.

Recall

Recall (also called sensitivity or true positive rate) answers: "Of all actual positives, how many did the model find?"

Recall = TP / (TP + FN)

Example

Dataset has 20 actual spam emails
Model catches 15 of them (TP = 15)
Model misses 5 (FN = 5)

Recall = 15 / (15 + 5) = 15 / 20 = 0.75 (75%)

When Recall Matters Most

High recall = few false negatives.

F1 Score

The F1 score is the harmonic mean of precision and recall. It balances both metrics into a single number.

F1 = 2 Ɨ (Precision Ɨ Recall) / (Precision + Recall)

Why Harmonic Mean?

The harmonic mean penalises extreme imbalances. If precision is 1.0 but recall is 0.01, the arithmetic mean would be 0.505 (looks okay), but the harmonic mean (F1) is 0.02 (correctly shows the model is bad).

Example

Precision = 0.75, Recall = 0.75

F1 = 2 Ɨ (0.75 Ɨ 0.75) / (0.75 + 0.75)
   = 2 Ɨ 0.5625 / 1.5
   = 0.75

Variants

Use F1 when: you need to balance precision and recall, especially on imbalanced datasets where accuracy is misleading.

The Precision-Recall Trade-off

Increasing precision typically decreases recall, and vice versa. You cannot maximise both simultaneously.

High threshold → model only flags when very confident
  → High precision, low recall (misses many positives)

Low threshold → model flags anything slightly suspicious
  → High recall, low precision (many false alarms)

The right balance depends on the cost of each type of error.

Example: Fraud Detection at a Bank

A bank processes 10,000 transactions per day. 50 are actually fraudulent.

Scenario A: High Threshold (Favour Precision)

The model only flags transactions it is very confident about.

Flagged: 20 transactions
  18 are real fraud   (TP = 18)
  2 are legitimate    (FP = 2)
  32 fraud missed     (FN = 32)

Precision = 18 / (18+2)  = 90%   ← very few false alarms
Recall    = 18 / (18+32) = 36%   ← missed 64% of fraud!

Customers are rarely bothered by false blocks, but the bank loses money on 32 undetected fraudulent transactions.

Scenario B: Low Threshold (Favour Recall)

The model flags anything slightly suspicious.

Flagged: 500 transactions
  47 are real fraud   (TP = 47)
  453 are legitimate  (FP = 453)
  3 fraud missed      (FN = 3)

Precision = 47 / (47+453) = 9.4%  ← most flags are false alarms
Recall    = 47 / (47+3)   = 94%   ← catches almost all fraud

Almost all fraud is caught, but 453 legitimate customers have their transactions blocked — causing frustration and support calls.

Scenario C: Balanced Threshold

Flagged: 80 transactions
  40 are real fraud   (TP = 40)
  40 are legitimate   (FP = 40)
  10 fraud missed     (FN = 10)

Precision = 40 / (40+40) = 50%
Recall    = 40 / (40+10) = 80%
F1        = 2Ɨ(0.5Ɨ0.8) / (0.5+0.8) = 0.615

A reasonable balance — catches 80% of fraud while keeping false alarms manageable.

Which Is Best?

Scenario     Precision   Recall   F1      Trade-off
──────────────────────────────────────────────────────────────
A (high)     90%         36%      0.51    Few false alarms, misses most fraud
B (low)      9.4%        94%      0.17    Catches fraud, overwhelmed by false alarms
C (balanced) 50%         80%      0.62    Practical middle ground

For fraud: the bank would likely lean toward higher recall (Scenario B or C)
  because missing fraud costs more than investigating false alarms.

For spam filtering: lean toward higher precision (Scenario A)
  because blocking a real email is worse than letting some spam through.

Quick Reference

Metric       Formula                          Question it answers
─────────────────────────────────────────────────────────────────────────
Accuracy     (TP+TN) / (TP+TN+FP+FN)         How often is the model correct overall?
Precision    TP / (TP+FP)                     When it says positive, is it right?
Recall       TP / (TP+FN)                     Did it find all the actual positives?
F1           2Ɨ(PƗR) / (P+R)                  Balance of precision and recall

Generation and NLP-Specific Metrics

BLEU (Bilingual Evaluation Understudy)

BLEU measures how much a generated text overlaps with one or more reference texts. Originally designed for machine translation.

How It Works

  1. Compare n-gram overlaps (unigrams, bigrams, trigrams, 4-grams) between the generated output and the reference
  2. Compute precision for each n-gram level — what fraction of generated n-grams appear in the reference
  3. Apply a brevity penalty to discourage outputs that are too short
  4. Combine into a single score between 0 and 1

Example

Reference:  "The cat is on the mat"
Generated:  "The cat sat on the mat"

Unigram matches:  "The", "cat", "on", "the", "mat" → 5/6 words match
Bigram matches:   "The cat", "on the", "the mat"   → 3/5 bigrams match

BLEU combines these n-gram precisions into one score.

Strengths and Weaknesses

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE measures overlap between generated text and reference text, but focuses on recall rather than precision. Widely used for text summarisation.

Variants

Example

Reference: "The cat is sitting on the mat"
Generated: "The cat is on the mat"

ROUGE-1 (unigram recall):
  Reference words: {The, cat, is, sitting, on, the, mat} → 7 words
  Matched in generated: {The, cat, is, on, the, mat}     → 6 matches
  ROUGE-1 Recall = 6/7 = 0.857

ROUGE-2 (bigram recall):
  Reference bigrams: {The cat, cat is, is sitting, sitting on, on the, the mat}
  Matched: {The cat, cat is, on the, the mat} → 4/6
  ROUGE-2 Recall = 4/6 = 0.667

BLEU vs ROUGE

BLEU                              ROUGE
─────────────────────             ─────────────────────
Precision-oriented                Recall-oriented
"How much of the output           "How much of the reference
 is in the reference?"             is captured in the output?"
Used for: translation             Used for: summarisation

Perplexity

Perplexity measures how well a language model predicts a sequence of text. It answers: "How surprised is the model by this text?"

Intuition

Perplexity = 2^(cross-entropy loss)

Lower perplexity → model is less "surprised" → better predictions
Higher perplexity → model is more "confused" → worse predictions

Example:
  Perplexity = 10 means the model is as uncertain as choosing
  uniformly among 10 options at each token position.

What It Tells You

Limitations

Word Error Rate (WER)

WER is the standard metric for speech recognition (ASR) and is also used for any task that compares a generated word sequence to a reference.

Formula

WER = (Substitutions + Insertions + Deletions) / Total words in reference

S = words that were replaced with a wrong word
I = extra words added that aren't in the reference
D = words from the reference that were missed

Example

Reference:  "the cat sat on the mat"     (6 words)
Predicted:  "the cat sit on a mat"

Differences:
  "sat" → "sit"     (1 Substitution)
  "the" → "a"       (1 Substitution)

WER = (2 + 0 + 0) / 6 = 2/6 = 0.333 (33.3%)

Properties

When to use: speech-to-text evaluation, OCR quality measurement, or any sequence comparison where edit distance matters.

Summary

Quick Reference Table

Metric Used For What It Measures Key Limitation
Accuracy Classification Overall correctness Misleading on imbalanced data
Precision Classification How many predicted positives are correct Ignores missed positives (FN)
Recall Classification How many actual positives were found Ignores false alarms (FP)
F1 Classification Balance of precision and recall Ignores true negatives
BLEU Translation N-gram precision vs reference No semantic understanding
ROUGE Summarisation N-gram recall vs reference No semantic understanding
Perplexity Language modelling How well the model predicts text Doesn't measure factual correctness
WER Speech recognition Edit distance between predicted and reference All errors weighted equally