Encoding Categorical Variables

Encoding Categorical Variables #
Is an essential step in machine learning because models only understand numbers. Your data may contain words like “Red,” “Blue,” “Paris,” or “London,” but algorithms cannot process text directly. Encoding Categorical Variables converts these categories into numerical values, helping models learn patterns accurately and improve performance.

What Are Categorical Variables? #

Definition: Features that represent categories or groups, not numbers.

Type	Example	Values
Nominal	Colors	Red, Blue, Green (no order)
Ordinal	Education Level	High School, Bachelor, Master, PhD (has order)
Binary	Yes/No	Yes, No

The Problem: Models cannot understand “Red” or “Paris”. They need numbers.

The Solution: Convert categories to numbers.

Three Main Encoding Methods #

Method	How it works	Best for
Label Encoding	Assigns a number to each category (0, 1, 2, 3…)	Ordinal categories (has order)
One-Hot Encoding	Creates binary columns for each category	Nominal categories (no order)
Target Encoding	Replaces category with mean of target	High cardinality (many unique values)

1. Label Encoding #

What it does: Assigns a unique integer to each category.

Example:

City	Label Encoded
Paris	0
London	1
Tokyo	2
New York	3

The Formula:

No formula. Just mapping: Category → Integer

Code:

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['city_encoded'] = encoder.fit_transform(df['city'])

# Result: Paris→0, London→1, Tokyo→2, New York→3

The Problem (Very Important):

Model sees: London (1) is between Paris (0) and Tokyo (2). It thinks London is “halfway” between them. It thinks New York (3) is “greater than” Tokyo (2).

This is wrong. Cities have no order.

When to use Label Encoding:

Use Case	Example
Ordinal categories (has natural order)	Education: High School(0), Bachelor(1), Master(2), PhD(3)
Tree-based models (Random Forest, XGBoost)	They can handle arbitrary numbers
Binary categories	Yes(1), No(0)

When NOT to use Label Encoding:

Nominal categories with no order (colors, cities, countries)
Linear models (Logistic Regression, Linear Regression, SVM)
KNN, Neural Networks (distance-based)

2. One-Hot Encoding #

What it does: Creates a new binary column for each unique category. Only one column is “hot” (1) per row.

Example:

City	Paris	London	Tokyo	New York
Paris	1	0	0	0
London	0	1	0	0
Tokyo	0	0	1	0
New York	0	0	0	1

The Formula:

For category k at position i:
Column_k = 1 if category == k else 0

Code:

# Using pandas
df_encoded = pd.get_dummies(df, columns=['city'])

# Using scikit-learn
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['city']])

The Problem: Too many columns!

Unique Categories	New Columns Created
10	10
100	100
1,000	1,000
10,000	10,000 (huge memory issue)

When to use One-Hot Encoding:

Use Case	Example
Nominal categories (no order)	City, Color, Product Category
Small number of categories (< 20)	Day of week (7), Months (12), Colors (10)
Linear models, KNN, Neural Networks	Works well with these

When NOT to use One-Hot Encoding:

High cardinality (> 50 unique categories) → Too many columns
Tree-based models (they handle label encoding well)

3. Target Encoding (Mean Encoding) #

What it does: Replaces each category with the average target value for that category.

Example: Predicting house price based on neighborhood.

Neighborhood	Average House Price	Encoded Value
Beverly Hills	$1,500,000	1.5M
Downtown	$500,000	0.5M
Suburbs	$400,000	0.4M

The Formula:

For category c:
Encoded Value = Average(target for all rows with category = c)

Code:

# Manual target encoding
target_means = df.groupby('neighborhood')['price'].mean()
df['neighborhood_encoded'] = df['neighborhood'].map(target_means)

# Using category_encoders library
import category_encoders as ce
encoder = ce.TargetEncoder()
df['neighborhood_encoded'] = encoder.fit_transform(df['neighborhood'], df['price'])

The Problem: Overfitting.

If a category appears only once, its encoded value = that single target value. Model memorizes it.

Solution: Smoothing. Mix category mean with global mean.

Smoothed Value = (n * category_mean + m * global_mean) / (n + m)

Where:
n = count of category
m = smoothing parameter (higher = more regularization)

When to use Target Encoding:

Use Case	Example
High cardinality (> 50 categories)	Zip codes, User IDs, Product IDs
You have enough data per category	At least 10-20 samples per category
Tree-based models (XGBoost, LightGBM)	They handle this well

When NOT to use Target Encoding:

Small datasets (risk of overfitting)
Categories with very few samples
When you cannot use cross-validation for encoding

Comparison Table #

Feature	Label Encoding	One-Hot Encoding	Target Encoding
Creates new columns	No (replaces)	Yes (N columns)	No (replaces)
Preserves order	Yes (introduces false order)	No	No
Handles high cardinality	✅ Yes	❌ No (too many columns)	✅ Yes
Risk of overfitting	Low	Low	High (needs smoothing)
Works with linear models	❌ No (false order)	✅ Yes	⚠️ Yes (with caution)
Works with tree models	✅ Yes	✅ Yes	✅ Yes (best)
Default choice	For ordinal only	For low cardinality nominal	For high cardinality

Decision Flowchart

Quick Quiz #

Q1: You have a “Country” column with 150 unique countries. You are using Linear Regression. Which encoding?

A1: One-Hot Encoding would create 150 columns (too many). Target Encoding is risky with linear models. Consider using embeddings or feature hashing.

Q2: You have “Education Level” (High School, Bachelor, Master, PhD). You are using Random Forest. Which encoding?

A2: Label Encoding. Education has natural order. Random Forest handles it well.

Q3: You have “Color” (Red, Blue, Green) with no order. You are using KNN. Which encoding?

A3: One-Hot Encoding. KNN is distance-based. Label encoding would incorrectly think Red < Blue < Green.

Q4: You have “Zip Code” with 500 unique values. You are using XGBoost. Which encoding?

A4: Target Encoding (with smoothing) or Label Encoding. XGBoost can handle both. Target encoding often works better.

Feature Scaling Handling Imbalanced Datasets