Classification Metrics
Confusion Matrix
All classification metrics are derived from four counts in the confusion matrix:
Predicted Positive Predicted Negative Actual Positive TP FN Actual Negative FP TN TP = True Positive ā correctly predicted positive FP = False Positive ā predicted positive, but actually negative (Type I error) FN = False Negative ā predicted negative, but actually positive (Type II error) TN = True Negative ā correctly predicted negative
Accuracy
Accuracy measures the fraction of all predictions that are correct.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Example
100 emails: 90 legitimate, 10 spam Model predicts ALL as "legitimate" TP = 0, TN = 90, FP = 0, FN = 10 Accuracy = (0 + 90) / 100 = 90% 90% accuracy ā but it caught ZERO spam!
When Accuracy Misleads
Accuracy is misleading with imbalanced datasets. If 99% of transactions are legitimate, a model that always predicts "not fraud" gets 99% accuracy while catching zero fraud. Use precision, recall, and F1 instead.
Use accuracy when: classes are roughly balanced and all errors are equally costly.
Precision
Precision answers: "When the model says positive, how often is it actually correct?"
Precision = TP / (TP + FP)
Example
Model flags 20 emails as spam 15 are actually spam (TP = 15) 5 are legitimate (FP = 5) Precision = 15 / (15 + 5) = 15 / 20 = 0.75 (75%)
When Precision Matters Most
- Spam filtering — flagging a legitimate email as spam is very costly (user misses important mail)
- Content moderation — wrongly removing safe content damages user trust
- Medical diagnosis — false positive for a serious disease causes unnecessary stress and procedures
High precision = few false positives.
Recall
Recall (also called sensitivity or true positive rate) answers: "Of all actual positives, how many did the model find?"
Recall = TP / (TP + FN)
Example
Dataset has 20 actual spam emails Model catches 15 of them (TP = 15) Model misses 5 (FN = 5) Recall = 15 / (15 + 5) = 15 / 20 = 0.75 (75%)
When Recall Matters Most
- Fraud detection — missing real fraud (false negative) means direct financial loss
- Disease screening — missing a sick patient is far worse than a false alarm
- Security threats — failing to detect an intrusion can be catastrophic
High recall = few false negatives.
F1 Score
The F1 score is the harmonic mean of precision and recall. It balances both metrics into a single number.
F1 = 2 Ć (Precision Ć Recall) / (Precision + Recall)
Why Harmonic Mean?
The harmonic mean penalises extreme imbalances. If precision is 1.0 but recall is 0.01, the arithmetic mean would be 0.505 (looks okay), but the harmonic mean (F1) is 0.02 (correctly shows the model is bad).
Example
Precision = 0.75, Recall = 0.75 F1 = 2 Ć (0.75 Ć 0.75) / (0.75 + 0.75) = 2 Ć 0.5625 / 1.5 = 0.75
Variants
- Macro F1 — compute F1 per class, then average. Treats all classes equally (good when rare classes matter)
- Micro F1 — aggregate TP, FP, FN across all classes, then compute F1. Weighted toward common classes
- F-beta — generalised version where beta controls the weight. F2 weighs recall higher, F0.5 weighs precision higher
Use F1 when: you need to balance precision and recall, especially on imbalanced datasets where accuracy is misleading.
The Precision-Recall Trade-off
Increasing precision typically decreases recall, and vice versa. You cannot maximise both simultaneously.
High threshold ā model only flags when very confident ā High precision, low recall (misses many positives) Low threshold ā model flags anything slightly suspicious ā High recall, low precision (many false alarms) The right balance depends on the cost of each type of error.
Example: Fraud Detection at a Bank
A bank processes 10,000 transactions per day. 50 are actually fraudulent.
Scenario A: High Threshold (Favour Precision)
The model only flags transactions it is very confident about.
Flagged: 20 transactions 18 are real fraud (TP = 18) 2 are legitimate (FP = 2) 32 fraud missed (FN = 32) Precision = 18 / (18+2) = 90% ā very few false alarms Recall = 18 / (18+32) = 36% ā missed 64% of fraud!
Customers are rarely bothered by false blocks, but the bank loses money on 32 undetected fraudulent transactions.
Scenario B: Low Threshold (Favour Recall)
The model flags anything slightly suspicious.
Flagged: 500 transactions 47 are real fraud (TP = 47) 453 are legitimate (FP = 453) 3 fraud missed (FN = 3) Precision = 47 / (47+453) = 9.4% ā most flags are false alarms Recall = 47 / (47+3) = 94% ā catches almost all fraud
Almost all fraud is caught, but 453 legitimate customers have their transactions blocked — causing frustration and support calls.
Scenario C: Balanced Threshold
Flagged: 80 transactions 40 are real fraud (TP = 40) 40 are legitimate (FP = 40) 10 fraud missed (FN = 10) Precision = 40 / (40+40) = 50% Recall = 40 / (40+10) = 80% F1 = 2Ć(0.5Ć0.8) / (0.5+0.8) = 0.615
A reasonable balance — catches 80% of fraud while keeping false alarms manageable.
Which Is Best?
Scenario Precision Recall F1 Trade-off āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā A (high) 90% 36% 0.51 Few false alarms, misses most fraud B (low) 9.4% 94% 0.17 Catches fraud, overwhelmed by false alarms C (balanced) 50% 80% 0.62 Practical middle ground For fraud: the bank would likely lean toward higher recall (Scenario B or C) because missing fraud costs more than investigating false alarms. For spam filtering: lean toward higher precision (Scenario A) because blocking a real email is worse than letting some spam through.
Quick Reference
Metric Formula Question it answers āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Accuracy (TP+TN) / (TP+TN+FP+FN) How often is the model correct overall? Precision TP / (TP+FP) When it says positive, is it right? Recall TP / (TP+FN) Did it find all the actual positives? F1 2Ć(PĆR) / (P+R) Balance of precision and recall
Generation and NLP-Specific Metrics
BLEU (Bilingual Evaluation Understudy)
BLEU measures how much a generated text overlaps with one or more reference texts. Originally designed for machine translation.
How It Works
- Compare n-gram overlaps (unigrams, bigrams, trigrams, 4-grams) between the generated output and the reference
- Compute precision for each n-gram level — what fraction of generated n-grams appear in the reference
- Apply a brevity penalty to discourage outputs that are too short
- Combine into a single score between 0 and 1
Example
Reference: "The cat is on the mat" Generated: "The cat sat on the mat" Unigram matches: "The", "cat", "on", "the", "mat" ā 5/6 words match Bigram matches: "The cat", "on the", "the mat" ā 3/5 bigrams match BLEU combines these n-gram precisions into one score.
Strengths and Weaknesses
- Strength: fast, automatic, reproducible, widely used as a baseline
- Weakness: purely lexical — synonyms score zero ("couch" vs "sofa"), ignores meaning and fluency
- Weakness: a factually wrong sentence can score high if it shares many words with the reference
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
ROUGE measures overlap between generated text and reference text, but focuses on recall rather than precision. Widely used for text summarisation.
Variants
- ROUGE-1 — unigram (single word) overlap between generated and reference
- ROUGE-2 — bigram overlap — captures phrase-level similarity
- ROUGE-L — longest common subsequence (LCS) — captures sentence-level structure without requiring consecutive matches
Example
Reference: "The cat is sitting on the mat"
Generated: "The cat is on the mat"
ROUGE-1 (unigram recall):
Reference words: {The, cat, is, sitting, on, the, mat} ā 7 words
Matched in generated: {The, cat, is, on, the, mat} ā 6 matches
ROUGE-1 Recall = 6/7 = 0.857
ROUGE-2 (bigram recall):
Reference bigrams: {The cat, cat is, is sitting, sitting on, on the, the mat}
Matched: {The cat, cat is, on the, the mat} ā 4/6
ROUGE-2 Recall = 4/6 = 0.667
BLEU vs ROUGE
BLEU ROUGE āāāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāā Precision-oriented Recall-oriented "How much of the output "How much of the reference is in the reference?" is captured in the output?" Used for: translation Used for: summarisation
Perplexity
Perplexity measures how well a language model predicts a sequence of text. It answers: "How surprised is the model by this text?"
Intuition
Perplexity = 2^(cross-entropy loss) Lower perplexity ā model is less "surprised" ā better predictions Higher perplexity ā model is more "confused" ā worse predictions Example: Perplexity = 10 means the model is as uncertain as choosing uniformly among 10 options at each token position.
What It Tells You
- Good for comparing language models on the same test set — lower is better
- Measures how well the model has learned the language patterns in the data
Limitations
- A model with low perplexity can still hallucinate fluent but false text
- Does not measure factual correctness, usefulness, or coherence
- Cannot compare perplexity across models with different vocabularies or tokenisers
Word Error Rate (WER)
WER is the standard metric for speech recognition (ASR) and is also used for any task that compares a generated word sequence to a reference.
Formula
WER = (Substitutions + Insertions + Deletions) / Total words in reference S = words that were replaced with a wrong word I = extra words added that aren't in the reference D = words from the reference that were missed
Example
Reference: "the cat sat on the mat" (6 words) Predicted: "the cat sit on a mat" Differences: "sat" ā "sit" (1 Substitution) "the" ā "a" (1 Substitution) WER = (2 + 0 + 0) / 6 = 2/6 = 0.333 (33.3%)
Properties
- Lower is better — WER = 0% means perfect transcription
- WER can exceed 100% if there are many insertions
- Does not account for severity — replacing "not" with "now" is counted the same as replacing "the" with "a"
When to use: speech-to-text evaluation, OCR quality measurement, or any sequence comparison where edit distance matters.
Summary
Quick Reference Table
| Metric | Used For | What It Measures | Key Limitation |
|---|---|---|---|
| Accuracy | Classification | Overall correctness | Misleading on imbalanced data |
| Precision | Classification | How many predicted positives are correct | Ignores missed positives (FN) |
| Recall | Classification | How many actual positives were found | Ignores false alarms (FP) |
| F1 | Classification | Balance of precision and recall | Ignores true negatives |
| BLEU | Translation | N-gram precision vs reference | No semantic understanding |
| ROUGE | Summarisation | N-gram recall vs reference | No semantic understanding |
| Perplexity | Language modelling | How well the model predicts text | Doesn't measure factual correctness |
| WER | Speech recognition | Edit distance between predicted and reference | All errors weighted equally |