Handling Imbalanced Datasets

Handling Imbalanced Datasets in Machine Learning #

Is one of the most common problems in classification tasks. You build a fraud detection model. It says 99% accurate. You celebrate. Then you realize: only 1% of transactions are fraud. Your model just predicted “Not Fraud” for everything. It learned nothing. This is the hidden trap of imbalanced datasets.

What is an Imbalanced Dataset? #

Simply put: one class has way more examples than the other.

Real life examples:

Fraud detection: 99% normal, 1% fraud
Disease detection: 98% healthy, 2% sick
Rare event prediction: 99.9% no event, 0.1% event

The problem: Your model becomes lazy. It learns to predict the majority class every time. Accuracy looks great. But the model is useless.

Why Accuracy Lies to You #

Imagine 1000 transactions. 990 normal. 10 fraud.

A dumb model that always says “normal” gets 99% accuracy. But it catches ZERO fraud.

Better metrics for imbalanced data:

Precision: Of the ones you called fraud, how many were actually fraud?
Recall: Of all actual fraud, how many did you catch?
F1 Score: Balance between precision and recall

These tell the real story. Not accuracy.

Three Ways to Fix Imbalanced Data #

Method	One Line Explanation
class_weight	Tell model “pay more attention to rare class”
Undersampling	Throw away some majority examples
SMOTE	Create fake minority examples

Method 1: class_weight (Easiest) #

You just add one line of code. The model automatically cares more about the minority class.

How it thinks: Missing a fraud is 100x worse than wrongly flagging a normal transaction.

model = RandomForestClassifier(class_weight='balanced')
model.fit(X, y)

When to use: Almost always. Start here. It is the easiest and often works well.

Downside: Does not create new data. Only changes how model learns.

Method 2: Undersampling (Delete Data) #

You randomly delete majority examples until both classes are equal.

Example: 990 normal, 10 fraud → Delete 980 normal → Keep 10 normal, 10 fraud

from sklearn.utils import resample

# Keep all fraud, take same number of normal randomly
normal_under = resample(normal, n_samples=len(fraud), random_state=42)
balanced = pd.concat([normal_under, fraud])

When to use: When you have millions of examples. Deleting some is fine.

Downside: You lose data. Model sees less variety.

Method 3: SMOTE (Create Fake Data) #

You create brand new synthetic minority examples.

How it works: Take two real fraud examples. Draw a line between them. Pick a random point on that line. That is your new fake fraud example.

Example: 10 fraud → Create 980 fake fraud → Now 990 fraud, 990 normal

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X, y)

When to use: Small dataset. Cannot afford to lose data.

Downside: Can create unrealistic examples if not careful.

Which One Should You Choose? #

Your Situation	Best Method
You want quick and easy	class_weight
You have millions of data	Undersampling
You have very little data	SMOTE
You are using XGBoost	class_weight (built-in)
You are using Neural Network	SMOTE + class_weight

Simple rule: Start with class_weight. If that fails, try SMOTE. Use undersampling only for huge datasets.

Complete Picture

Problem: Detect fraud. 10,000 transactions. 100 fraud (1%).

Step 1: Train model with no balancing.

Result: 99% accuracy. Catches 5% of fraud. ❌ Fails.

Step 2: Add class_weight=’balanced’

Result: Catches 60% of fraud. ✅ Much better.

Step 3: Try SMOTE

Result: Catches 75% of fraud. ✅ Even better.

Step 4: Combine SMOTE + class_weight

Result: Catches 78% of fraud. ✅ Best.

Important Note #

Balancing data does not guarantee success. If fraud looks exactly like normal transactions, no method will work. You need good features first.

Quick Quiz #

Q1: Your model has 98% accuracy but catches only 10% of fraud. What is wrong?

A1: Imbalanced dataset. Accuracy is lying. Check precision and recall instead.

Q2: You have 1 million normal and 1000 fraud. Which method is fastest?

A2: Undersampling. Delete 999,000 normal. Keep 1000 each.

Q3: You have 500 normal and 500 fraud. Is your dataset balanced?

A3: Yes. Perfect balance 1:1. No need for these methods.

Q4: Which method creates brand new fake data?

A4: SMOTE. It creates synthetic minority examples.

Encoding Categorical Variables Feature Engineering for ML