Evaluation Metrics for Classification

“You built a cancer detection model. It says 99% accurate. Sounds great. But what if only 1% of patients actually have cancer? Your model could predict ‘no cancer’ for everyone and still be 99% accurate. And it would kill people. Accuracy lied. You need better metrics.”

The Confusion Matrix (The Foundation) #

Before metrics, understand this 2×2 table.

	Predicted: YES	Predicted: NO
Actual: YES	True Positive (TP) ✅	False Negative (FN) ❌
Actual: NO	False Positive (FP) ❌	True Negative (TN) ✅

Simple meanings:

Term	Meaning	Example (Cancer Detection)
True Positive	Correctly predicted YES	Correctly said “has cancer”
True Negative	Correctly predicted NO	Correctly said “no cancer”
False Positive	Wrongly predicted YES (Type I error)	Said “has cancer” but healthy
False Negative	Wrongly predicted NO (Type II error)	Said “no cancer” but has it

The Four Main Metrics #

Metric	What It Measures	Formula	Best For
Accuracy	Overall correctness	(TP+TN)/(Total)	Balanced classes
Precision	Trust positive predictions	TP/(TP+FP)	When false positives are costly
Recall	Catch all positives	TP/(TP+FN)	When false negatives are costly
F1 Score	Balance of precision and recall	2×(P×R)/(P+R)	Imbalanced classes

1. Accuracy (The Liar) #

What it is: Proportion of correct predictions (both YES and NO).

The Formula:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Example (Cancer Detection):

990 healthy, 10 sick
Model predicts “healthy” for everyone

	Predicted Healthy	Predicted Sick
Actually Healthy	990 (TN)	0 (FP)
Actually Sick	10 (FN)	0 (TP)

Accuracy = (990 + 0) / 1000 = 99%

Problem: Model caught ZERO sick patients. But accuracy looks amazing.

When to use Accuracy:

Classes are balanced (50% Yes, 50% No)
False positives and false negatives have same cost

When NOT to use Accuracy:

Imbalanced classes (fraud, disease, rare events)
Different costs for different errors

2. Precision (When You Say YES, Are You Right?) #

What it is: Of all the times you predicted YES, how many were actually YES?

The Formula:

Precision = TP / (TP + FP)

The Question: “When my model says something is positive, can I trust it?”

Example (Cancer Detection):

Model makes 100 positive predictions. 90 are correct. 10 are wrong.

Precision = 90 / 100 = 90%

Interpretation: “When my model says you have cancer, it is correct 90% of the time.”

When to care about Precision:

Scenario	Why Precision Matters
Spam detection	Marking good email as spam angers users
Recommended videos	Showing bad recommendations loses trust
Hiring tool	Rejecting good candidates hurts company
Fraud alert	False alarms annoy customers

High Precision = Few false positives.

3. Recall (Did You Catch All the YES?) #

What it is: Of all the actual YES cases, how many did you catch?

The Formula:

Recall = TP / (TP + FN)

The Question: “Did my model miss any real positives?”

Example (Cancer Detection):

There are 100 sick patients. Model catches 80. Misses 20.

Recall = 80 / 100 = 80%

Interpretation: “My model catches 80% of all cancer cases.”

When to care about Recall:

Scenario	Why Recall Matters
Cancer detection	Missing cancer kills people
Airport security	Missing a weapon is disaster
Fraud detection	Missing fraud loses money
Self-driving car	Missing a pedestrian causes accident

High Recall = Few false negatives.

4. F1 Score (The Balance Keeper) #

What it is: Harmonic mean of precision and recall. Balances both.

The Formula:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why Harmonic Mean? Regular average lies. Harmonic mean punishes extreme differences.

Example:

Model A	Precision	Recall	Regular Average	F1 Score
High Precision, Low Recall	99%	50%	74.5%	66%
Balanced	80%	80%	80%	80%

Regular average says 74.5% for the unbalanced model. That is misleading. F1 correctly shows it is worse (66%).

Interpretation: “My model balances catching positives and being correct when it does.”

When to use F1 Score:

Imbalanced datasets (always use F1, not accuracy)
When you care about both precision and recall
When you need one number to compare models

The Precision-Recall Trade-off #

You cannot have both high precision and high recall. Choose based on your problem.

If You Increase	Precision Does	Recall Does
Threshold for positive	↑ Increases	↓ Decreases
Model complexity	↓ Decreases	↑ Increases

Real Example: Cancer Detection

Strategy	Precision	Recall	Result
Aggressive (call many sick)	Low (50%)	High (95%)	Many false alarms, but catch most cancer
Conservative (only sure cases)	High (99%)	Low (60%)	Few false alarms, but miss many cancer

Which is better? Depends on cost.

Missing cancer (low recall) = patient dies → High recall needed
False alarm (low precision) = patient stressed but alive → Lower priority

The Trade-off Summary Table #

Metric	What It Rewards	What It Ignores
Accuracy	Getting both right	Class imbalance
Precision	Being right when you say YES	Missing actual YES cases
Recall	Catching all YES cases	Being wrong sometimes
F1	Balance of both	Neither extreme

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Actual and predicted
y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 10 not spam (0)
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1]   # 10 spam (1)

y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 10 correct not spam
          1, 1, 1, 1, 1, 1, 0, 0, 0, 0]   # 6 correct spam, 4 missed

# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)

print(f"Accuracy:  {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1 Score:  {f1:.2f}")
print(f"\nConfusion Matrix:")
print(f"TN: {cm[0,0]}, FP: {cm[0,1]}")
print(f"FN: {cm[1,0]}, TP: {cm[1,1]}")

Output:

Accuracy:  0.80
Precision: 1.00
Recall:    0.60
F1 Score:  0.75

Confusion Matrix:
TN: 10, FP: 0
FN: 4,  TP: 6

Quick Quiz #

Q1: Your model has 99% accuracy but 1% recall. What is happening?

A1: Severe class imbalance. Model predicts majority class only. Catches almost no positives.

Q2: You are building a fraud detection system. False alarms annoy customers. Missing fraud loses money. Which is more costly?

A2: Depends on business. Usually missing fraud (low recall) costs more. But both matter. Use F1.

Q3: Your precision is 50%, recall is 100%. What does this mean?

A3: You catch every positive (recall=100%). But half your positive predictions are wrong (precision=50%). You are over-predicting.

Q4: Precision = 90%, Recall = 90%. What is F1?

A4: F1 = 90% (same as both). When precision = recall, F1 equals that value.

Q5: Why not use accuracy for cancer detection?

A5: Cancer is rare (imbalanced). A model predicting “no cancer” for everyone gets high accuracy but kills patients.

Evaluation Metrics for Regression Confusion Matrix