View Categories

Evaluation Metrics for Classification

“You built a cancer detection model. It says 99% accurate. Sounds great. But what if only 1% of patients actually have cancer? Your model could predict ‘no cancer’ for everyone and still be 99% accurate. And it would kill people. Accuracy lied. You need better metrics.”

The Confusion Matrix (The Foundation) #

Before metrics, understand this 2×2 table.

Predicted: YESPredicted: NO
Actual: YESTrue Positive (TP) ✅False Negative (FN) ❌
Actual: NOFalse Positive (FP) ❌True Negative (TN) ✅

Simple meanings:

TermMeaningExample (Cancer Detection)
True PositiveCorrectly predicted YESCorrectly said “has cancer”
True NegativeCorrectly predicted NOCorrectly said “no cancer”
False PositiveWrongly predicted YES (Type I error)Said “has cancer” but healthy
False NegativeWrongly predicted NO (Type II error)Said “no cancer” but has it
Evaluation Metrics for Classification

The Four Main Metrics #

MetricWhat It MeasuresFormulaBest For
AccuracyOverall correctness(TP+TN)/(Total)Balanced classes
PrecisionTrust positive predictionsTP/(TP+FP)When false positives are costly
RecallCatch all positivesTP/(TP+FN)When false negatives are costly
F1 ScoreBalance of precision and recall2×(P×R)/(P+R)Imbalanced classes

1. Accuracy (The Liar) #

What it is: Proportion of correct predictions (both YES and NO).

The Formula:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Example (Cancer Detection):

  • 990 healthy, 10 sick
  • Model predicts “healthy” for everyone
Predicted HealthyPredicted Sick
Actually Healthy990 (TN)0 (FP)
Actually Sick10 (FN)0 (TP)

Accuracy = (990 + 0) / 1000 = 99%

Problem: Model caught ZERO sick patients. But accuracy looks amazing.

When to use Accuracy:

  • Classes are balanced (50% Yes, 50% No)
  • False positives and false negatives have same cost

When NOT to use Accuracy:

  • Imbalanced classes (fraud, disease, rare events)
  • Different costs for different errors
Evaluation Metrics for Classification 2

2. Precision (When You Say YES, Are You Right?) #

What it is: Of all the times you predicted YES, how many were actually YES?

The Formula:

Precision = TP / (TP + FP)

The Question: “When my model says something is positive, can I trust it?”

Example (Cancer Detection):

Model makes 100 positive predictions. 90 are correct. 10 are wrong.

Precision = 90 / 100 = 90%

Interpretation: “When my model says you have cancer, it is correct 90% of the time.”

When to care about Precision:

ScenarioWhy Precision Matters
Spam detectionMarking good email as spam angers users
Recommended videosShowing bad recommendations loses trust
Hiring toolRejecting good candidates hurts company
Fraud alertFalse alarms annoy customers

High Precision = Few false positives.

Evaluation Metrics for Classification 4

3. Recall (Did You Catch All the YES?) #

What it is: Of all the actual YES cases, how many did you catch?

The Formula:

Recall = TP / (TP + FN)

The Question: “Did my model miss any real positives?”

Example (Cancer Detection):

There are 100 sick patients. Model catches 80. Misses 20.

Recall = 80 / 100 = 80%

Interpretation: “My model catches 80% of all cancer cases.”

When to care about Recall:

ScenarioWhy Recall Matters
Cancer detectionMissing cancer kills people
Airport securityMissing a weapon is disaster
Fraud detectionMissing fraud loses money
Self-driving carMissing a pedestrian causes accident

High Recall = Few false negatives.

Evaluation Metrics for Classification 6

4. F1 Score (The Balance Keeper) #

What it is: Harmonic mean of precision and recall. Balances both.

The Formula:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why Harmonic Mean? Regular average lies. Harmonic mean punishes extreme differences.

Example:

Model APrecisionRecallRegular AverageF1 Score
High Precision, Low Recall99%50%74.5%66%
Balanced80%80%80%80%

Regular average says 74.5% for the unbalanced model. That is misleading. F1 correctly shows it is worse (66%).

Interpretation: “My model balances catching positives and being correct when it does.”

When to use F1 Score:

  • Imbalanced datasets (always use F1, not accuracy)
  • When you care about both precision and recall
  • When you need one number to compare models
Evaluation Metrics for Classification 8

The Precision-Recall Trade-off #

You cannot have both high precision and high recall. Choose based on your problem.

If You IncreasePrecision DoesRecall Does
Threshold for positive↑ Increases↓ Decreases
Model complexity↓ Decreases↑ Increases

Real Example: Cancer Detection

StrategyPrecisionRecallResult
Aggressive (call many sick)Low (50%)High (95%)Many false alarms, but catch most cancer
Conservative (only sure cases)High (99%)Low (60%)Few false alarms, but miss many cancer

Which is better? Depends on cost.

  • Missing cancer (low recall) = patient dies → High recall needed
  • False alarm (low precision) = patient stressed but alive → Lower priority

The Trade-off Summary Table #

MetricWhat It RewardsWhat It Ignores
AccuracyGetting both rightClass imbalance
PrecisionBeing right when you say YESMissing actual YES cases
RecallCatching all YES casesBeing wrong sometimes
F1Balance of bothNeither extreme
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Actual and predicted
y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 10 not spam (0)
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1]   # 10 spam (1)

y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 10 correct not spam
          1, 1, 1, 1, 1, 1, 0, 0, 0, 0]   # 6 correct spam, 4 missed

# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)

print(f"Accuracy:  {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1 Score:  {f1:.2f}")
print(f"\nConfusion Matrix:")
print(f"TN: {cm[0,0]}, FP: {cm[0,1]}")
print(f"FN: {cm[1,0]}, TP: {cm[1,1]}")

Output:

Accuracy:  0.80
Precision: 1.00
Recall:    0.60
F1 Score:  0.75

Confusion Matrix:
TN: 10, FP: 0
FN: 4,  TP: 6

Quick Quiz #

Q1: Your model has 99% accuracy but 1% recall. What is happening?

A1: Severe class imbalance. Model predicts majority class only. Catches almost no positives.

Q2: You are building a fraud detection system. False alarms annoy customers. Missing fraud loses money. Which is more costly?

A2: Depends on business. Usually missing fraud (low recall) costs more. But both matter. Use F1.

Q3: Your precision is 50%, recall is 100%. What does this mean?

A3: You catch every positive (recall=100%). But half your positive predictions are wrong (precision=50%). You are over-predicting.

Q4: Precision = 90%, Recall = 90%. What is F1?

A4: F1 = 90% (same as both). When precision = recall, F1 equals that value.

Q5: Why not use accuracy for cancer detection?

A5: Cancer is rare (imbalanced). A model predicting “no cancer” for everyone gets high accuracy but kills patients.

💬
AIRA (AI Research Assistant) Neural Learning Interface • Drag & Resize Enabled
×