Classification Metrics: Accuracy, Precision, Recall, F1

Beginner

Start with the confusion matrix

Every classification metric is derived from four counts. Before looking at any formula, understand what those four counts mean for your specific task.

                   Predicted POSITIVE    Predicted NEGATIVE
Actual POSITIVE        TP (hit)              FN (miss)
Actual NEGATIVE        FP (false alarm)      TN (correct rejection)

TP = True Positive  — model said yes, answer is yes
FP = False Positive — model said yes, answer is no   (Type I error)
FN = False Negative — model said no, answer is yes   (Type II error)
TN = True Negative  — model said no, answer is no

A concrete example: a spam classifier checks 1000 emails. 100 are actually spam.

Model result:
  Correctly flagged as spam   (TP) = 80
  Missed spam, landed in inbox (FN) = 20
  Legitimate email flagged    (FP) = 10
  Legitimate email passed     (TN) = 890

Accuracy

Question answered: "Out of everything, how often was the model correct?"

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Spam example: (80 + 890) / 1000 = 97.0%

97% looks impressive. But a model that never flags anything at all would score (0 + 900) / 1000 = 90% without catching a single spam email. Accuracy misleads badly when classes are imbalanced.

When accuracy is useful: classes are roughly balanced and all mistake types cost about the same (e.g., handwritten digit recognition where each digit is equally common).

Precision

Question answered: "When the model says positive, how often is it right?"

Precision = TP / (TP + FP)

Spam example: 80 / (80 + 10) = 88.9%

Precision measures the cost of false alarms. A low-precision spam filter moves many legitimate emails to the spam folder. Precision is the metric to optimize when false positives are expensive: wrongly blocking a transaction, wrongly flagging a user for review, or wrongly rejecting a job application.

Recall (True Positive Rate)

Question answered: "Of all the actual positives, how many did the model find?"

Recall = TP / (TP + FN)       (also called Sensitivity or True Positive Rate)

Spam example: 80 / (80 + 20) = 80.0%

Recall measures the cost of misses. A low-recall disease screener fails to flag sick patients. Recall is the metric to optimize when false negatives are expensive: missing a cancer case, letting fraud through, failing to detect an intrusion.

The precision-recall trade-off

Adjusting the decision threshold moves precision and recall in opposite directions. Raising the threshold (only flag when very confident) increases precision and reduces recall. Lowering it catches more positives (higher recall) but also raises more false alarms (lower precision).

Higher threshold  ->  fewer predictions flagged  ->  precision UP, recall DOWN
Lower threshold   ->  more predictions flagged   ->  precision DOWN, recall UP

You cannot maximize both simultaneously without improving the model itself.

F1 score

Question answered: "Is there a single number that balances precision and recall?"

F1 = 2 * (Precision * Recall) / (Precision + Recall)
   = 2 * TP / (2 * TP + FP + FN)

Spam example: 2 * (0.889 * 0.800) / (0.889 + 0.800) = 84.2%

F1 is the harmonic mean of precision and recall. The harmonic mean punishes extreme imbalances: a model with 100% precision and 1% recall gets an F1 of only 2%, not 50%. This makes F1 a tougher and fairer summary than a simple average.

When F1 is useful: you care about both false positives and false negatives equally, and the positive class is rare or all errors cost about the same (e.g., named entity recognition, information extraction).

Side-by-side comparison

Metric	Question answered	Formula	What it penalizes	Use when
Accuracy	Out of everything, how often was the model correct?	(TP + TN) / total	Any wrong prediction	Classes are balanced and all errors cost the same
Precision	When the model says positive, how often is it right?	TP / (TP + FP)	False positives (false alarms)	False alarms are costly (spam filter, fraud flag, content moderation)
Recall	Of all the actual positives, how many did the model find?	TP / (TP + FN)	False negatives (misses)	Missing a case is costly (disease screening, safety detection, fraud detection)
F1	Is there a single number that balances precision and recall?	2·P·R / (P + R)	Imbalance between precision and recall	Both error types matter and the positive class is rare

To-do list

Learn

Memorize TP, FP, FN, TN and what each cell of the confusion matrix represents.
Derive accuracy, precision, recall, and F1 from those four counts without looking them up.
Understand why accuracy can be 99% on a dataset where the model has learned nothing useful.
Understand the precision-recall trade-off and how the classification threshold controls it.
Learn the difference between macro, micro, and weighted F1 and when each is appropriate.

Practice

Given a confusion matrix with specific TP, FP, FN, TN counts, compute all four metrics by hand.
Construct a 1000-sample imbalanced dataset (5% positive), predict all-negative, and compute each metric to see which ones are misleading.
Use scikit-learn's classification_report on a real dataset and explain every number in the output.
Plot a precision-recall curve, pick a threshold that achieves at least 90% recall, and state what precision that yields.
Compare macro and micro F1 on a three-class problem where one class has 10x more samples.

Build

Write a function that takes y_true and y_pred and prints a formatted report with all four metrics plus the confusion matrix.
Train a binary classifier on an imbalanced dataset (e.g., credit card fraud), evaluate with accuracy, then switch to F1 and precision/recall and compare conclusions.
Build a threshold-selection tool: for a given FP-to-FN cost ratio, find the threshold on the PR curve that minimizes total cost.
Add per-class metric breakdowns to an existing evaluation pipeline and identify which class is the weakest link.