View Categories

Encoding Categorical Variables

Encoding Categorical Variables #

Is an essential step in machine learning because models only understand numbers. Your data may contain words like “Red,” “Blue,” “Paris,” or “London,” but algorithms cannot process text directly. Encoding Categorical Variables converts these categories into numerical values, helping models learn patterns accurately and improve performance.

What Are Categorical Variables? #

Definition: Features that represent categories or groups, not numbers.

TypeExampleValues
NominalColorsRed, Blue, Green (no order)
OrdinalEducation LevelHigh School, Bachelor, Master, PhD (has order)
BinaryYes/NoYes, No

The Problem: Models cannot understand “Red” or “Paris”. They need numbers.

The Solution: Convert categories to numbers.

Encoding Categorical Variables

Three Main Encoding Methods #

MethodHow it worksBest for
Label EncodingAssigns a number to each category (0, 1, 2, 3…)Ordinal categories (has order)
One-Hot EncodingCreates binary columns for each categoryNominal categories (no order)
Target EncodingReplaces category with mean of targetHigh cardinality (many unique values)

1. Label Encoding #

What it does: Assigns a unique integer to each category.

Example:

CityLabel Encoded
Paris0
London1
Tokyo2
New York3

The Formula:

No formula. Just mapping: Category → Integer

Code:

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['city_encoded'] = encoder.fit_transform(df['city'])

# Result: Paris→0, London→1, Tokyo→2, New York→3

The Problem (Very Important):

Model sees: London (1) is between Paris (0) and Tokyo (2). It thinks London is “halfway” between them. It thinks New York (3) is “greater than” Tokyo (2).

This is wrong. Cities have no order.

When to use Label Encoding:

Use CaseExample
Ordinal categories (has natural order)Education: High School(0), Bachelor(1), Master(2), PhD(3)
Tree-based models (Random Forest, XGBoost)They can handle arbitrary numbers
Binary categoriesYes(1), No(0)

When NOT to use Label Encoding:

  • Nominal categories with no order (colors, cities, countries)
  • Linear models (Logistic Regression, Linear Regression, SVM)
  • KNN, Neural Networks (distance-based)
Encoding Categorical Variables 2

2. One-Hot Encoding #

What it does: Creates a new binary column for each unique category. Only one column is “hot” (1) per row.

Example:

CityParisLondonTokyoNew York
Paris1000
London0100
Tokyo0010
New York0001

The Formula:

For category k at position i:
Column_k = 1 if category == k else 0

Code:

# Using pandas
df_encoded = pd.get_dummies(df, columns=['city'])

# Using scikit-learn
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['city']])

The Problem: Too many columns!

Unique CategoriesNew Columns Created
1010
100100
1,0001,000
10,00010,000 (huge memory issue)

When to use One-Hot Encoding:

Use CaseExample
Nominal categories (no order)City, Color, Product Category
Small number of categories (< 20)Day of week (7), Months (12), Colors (10)
Linear models, KNN, Neural NetworksWorks well with these

When NOT to use One-Hot Encoding:

  • High cardinality (> 50 unique categories) → Too many columns
  • Tree-based models (they handle label encoding well)
Encoding Categorical Variables 4

3. Target Encoding (Mean Encoding) #

What it does: Replaces each category with the average target value for that category.

Example: Predicting house price based on neighborhood.

NeighborhoodAverage House PriceEncoded Value
Beverly Hills$1,500,0001.5M
Downtown$500,0000.5M
Suburbs$400,0000.4M

The Formula:

For category c:
Encoded Value = Average(target for all rows with category = c)

Code:

# Manual target encoding
target_means = df.groupby('neighborhood')['price'].mean()
df['neighborhood_encoded'] = df['neighborhood'].map(target_means)

# Using category_encoders library
import category_encoders as ce
encoder = ce.TargetEncoder()
df['neighborhood_encoded'] = encoder.fit_transform(df['neighborhood'], df['price'])

The Problem: Overfitting.

If a category appears only once, its encoded value = that single target value. Model memorizes it.

Solution: Smoothing. Mix category mean with global mean.

Smoothed Value = (n * category_mean + m * global_mean) / (n + m)

Where:
n = count of category
m = smoothing parameter (higher = more regularization)

When to use Target Encoding:

Use CaseExample
High cardinality (> 50 categories)Zip codes, User IDs, Product IDs
You have enough data per categoryAt least 10-20 samples per category
Tree-based models (XGBoost, LightGBM)They handle this well

When NOT to use Target Encoding:

  • Small datasets (risk of overfitting)
  • Categories with very few samples
  • When you cannot use cross-validation for encoding
Encoding Categorical Variables 6

Comparison Table #

FeatureLabel EncodingOne-Hot EncodingTarget Encoding
Creates new columnsNo (replaces)Yes (N columns)No (replaces)
Preserves orderYes (introduces false order)NoNo
Handles high cardinality✅ Yes❌ No (too many columns)✅ Yes
Risk of overfittingLowLowHigh (needs smoothing)
Works with linear models❌ No (false order)✅ Yes⚠️ Yes (with caution)
Works with tree models✅ Yes✅ Yes✅ Yes (best)
Default choiceFor ordinal onlyFor low cardinality nominalFor high cardinality

Decision Flowchart

Encoding Categorical Variables 8

Quick Quiz #

Q1: You have a “Country” column with 150 unique countries. You are using Linear Regression. Which encoding?

A1: One-Hot Encoding would create 150 columns (too many). Target Encoding is risky with linear models. Consider using embeddings or feature hashing.

Q2: You have “Education Level” (High School, Bachelor, Master, PhD). You are using Random Forest. Which encoding?

A2: Label Encoding. Education has natural order. Random Forest handles it well.

Q3: You have “Color” (Red, Blue, Green) with no order. You are using KNN. Which encoding?

A3: One-Hot Encoding. KNN is distance-based. Label encoding would incorrectly think Red < Blue < Green.

Q4: You have “Zip Code” with 500 unique values. You are using XGBoost. Which encoding?

A4: Target Encoding (with smoothing) or Label Encoding. XGBoost can handle both. Target encoding often works better.

💬
AIRA (AI Research Assistant) Neural Learning Interface • Drag & Resize Enabled
×