Beginner
Start with the confusion matrix
Every classification metric is derived from four counts. Before looking at any formula, understand what those four counts mean for your specific task.
Predicted POSITIVE Predicted NEGATIVE Actual POSITIVE TP (hit) FN (miss) Actual NEGATIVE FP (false alarm) TN (correct rejection) TP = True Positive โ model said yes, answer is yes FP = False Positive โ model said yes, answer is no (Type I error) FN = False Negative โ model said no, answer is yes (Type II error) TN = True Negative โ model said no, answer is no
A concrete example: a spam classifier checks 1000 emails. 100 are actually spam.
Model result: Correctly flagged as spam (TP) = 80 Missed spam, landed in inbox (FN) = 20 Legitimate email flagged (FP) = 10 Legitimate email passed (TN) = 890
Accuracy
Question answered: "Out of everything, how often was the model correct?"
Accuracy = (TP + TN) / (TP + TN + FP + FN) Spam example: (80 + 890) / 1000 = 97.0%
97% looks impressive. But a model that never flags anything at all would score (0 + 900) / 1000 = 90% without catching a single spam email. Accuracy misleads badly when classes are imbalanced.
When accuracy is useful: classes are roughly balanced and all mistake types cost about the same (e.g., handwritten digit recognition where each digit is equally common).
Precision
Question answered: "When the model says positive, how often is it right?"
Precision = TP / (TP + FP) Spam example: 80 / (80 + 10) = 88.9%
Precision measures the cost of false alarms. A low-precision spam filter moves many legitimate emails to the spam folder. Precision is the metric to optimize when false positives are expensive: wrongly blocking a transaction, wrongly flagging a user for review, or wrongly rejecting a job application.
Recall (True Positive Rate)
Question answered: "Of all the actual positives, how many did the model find?"
Recall = TP / (TP + FN) (also called Sensitivity or True Positive Rate) Spam example: 80 / (80 + 20) = 80.0%
Recall measures the cost of misses. A low-recall disease screener fails to flag sick patients. Recall is the metric to optimize when false negatives are expensive: missing a cancer case, letting fraud through, failing to detect an intrusion.
The precision-recall trade-off
Adjusting the decision threshold moves precision and recall in opposite directions. Raising the threshold (only flag when very confident) increases precision and reduces recall. Lowering it catches more positives (higher recall) but also raises more false alarms (lower precision).
Higher threshold -> fewer predictions flagged -> precision UP, recall DOWN Lower threshold -> more predictions flagged -> precision DOWN, recall UP You cannot maximize both simultaneously without improving the model itself.
F1 score
Question answered: "Is there a single number that balances precision and recall?"
F1 = 2 * (Precision * Recall) / (Precision + Recall) = 2 * TP / (2 * TP + FP + FN) Spam example: 2 * (0.889 * 0.800) / (0.889 + 0.800) = 84.2%
F1 is the harmonic mean of precision and recall. The harmonic mean punishes extreme imbalances: a model with 100% precision and 1% recall gets an F1 of only 2%, not 50%. This makes F1 a tougher and fairer summary than a simple average.
When F1 is useful: you care about both false positives and false negatives equally, and the positive class is rare or all errors cost about the same (e.g., named entity recognition, information extraction).
Side-by-side comparison
| Metric | Question answered | Formula | What it penalizes | Use when |
|---|---|---|---|---|
| Accuracy | Out of everything, how often was the model correct? | (TP + TN) / total | Any wrong prediction | Classes are balanced and all errors cost the same |
| Precision | When the model says positive, how often is it right? | TP / (TP + FP) | False positives (false alarms) | False alarms are costly (spam filter, fraud flag, content moderation) |
| Recall | Of all the actual positives, how many did the model find? | TP / (TP + FN) | False negatives (misses) | Missing a case is costly (disease screening, safety detection, fraud detection) |
| F1 | Is there a single number that balances precision and recall? | 2ยทPยทR / (P + R) | Imbalance between precision and recall | Both error types matter and the positive class is rare |
To-do list
Learn
- Memorize TP, FP, FN, TN and what each cell of the confusion matrix represents.
- Derive accuracy, precision, recall, and F1 from those four counts without looking them up.
- Understand why accuracy can be 99% on a dataset where the model has learned nothing useful.
- Understand the precision-recall trade-off and how the classification threshold controls it.
- Learn the difference between macro, micro, and weighted F1 and when each is appropriate.
Practice
- Given a confusion matrix with specific TP, FP, FN, TN counts, compute all four metrics by hand.
- Construct a 1000-sample imbalanced dataset (5% positive), predict all-negative, and compute each metric to see which ones are misleading.
- Use scikit-learn's
classification_reporton a real dataset and explain every number in the output. - Plot a precision-recall curve, pick a threshold that achieves at least 90% recall, and state what precision that yields.
- Compare macro and micro F1 on a three-class problem where one class has 10x more samples.
Build
- Write a function that takes
y_trueandy_predand prints a formatted report with all four metrics plus the confusion matrix. - Train a binary classifier on an imbalanced dataset (e.g., credit card fraud), evaluate with accuracy, then switch to F1 and precision/recall and compare conclusions.
- Build a threshold-selection tool: for a given FP-to-FN cost ratio, find the threshold on the PR curve that minimizes total cost.
- Add per-class metric breakdowns to an existing evaluation pipeline and identify which class is the weakest link.