Feature Engineering for ML

Feature Engineering for ML #
Are the most important part of building high-performing models. Your model is only as good as the features you give it. A bad feature is like giving a detective the wrong clues. A good feature makes the answer obvious. Feature engineering is the art of creating those good clues from raw data.

What is Feature Engineering? #

Simple Definition: Creating new features from existing data that help your model predict better.

Raw data is rarely ready to use. You must transform it. You must combine it. You must extract hidden information from it.

Example: You have a “date” column. Raw date is useless. But from it you can create:

Day of week (Monday=1, Tuesday=2)
Is weekend? (0 or 1)
Month of year
Days since last holiday
Quarter of year

These new features help your model find patterns.

Why Feature Engineering Matters #

Here is the truth:

Factor	Impact on Model Performance
Algorithm choice	10-20% impact
Hyperparameter tuning	5-10% impact
Amount of data	20-30% impact
Feature engineering	40-60% impact

Yes. Feature engineering matters more than which algorithm you pick. A simple model with great features beats a complex model with bad features.

The Golden Rule #

“Know your data. Domain knowledge is feature engineering superpower.”

A doctor knows which medical measurements matter. A finance expert knows which ratios predict default. No algorithm can replace that knowledge.

Common Feature Engineering Techniques #

1. Handling Date and Time #

Raw date is almost useless to a model. Extract these instead:

Raw Date	Engineered Features
2024-03-15	Year=2024, Month=3, Day=15, DayOfWeek=Friday, IsWeekend=True, Quarter=1, DaysSinceStartOfYear=74

Example: Predicting store sales.

Bad feature: “Date” (model cannot understand)
Good feature: “IsWeekend” (sales are higher on weekends)
Good feature: “MonthsSinceLastHoliday” (sales drop after holidays)

import pandas as pd

df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.dayofweek  # Monday=0, Sunday=6
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter
df['year'] = df['date'].dt.year

2. Creating Ratios and Combinations #

Two features together often tell a better story than each alone.

Example: Predicting house prices.

Bad Features	Engineered Feature
Total Rooms, Households	Rooms per Household
Total Bedrooms, Total Rooms	Bedroom Ratio
Population, Households	People per Household

Why it works: A house with 10 rooms is big. A house with 10 rooms and 1 household is a mansion. A house with 10 rooms and 10 households is a crowded apartment. Same rooms. Different meaning.

df['rooms_per_house'] = df['total_rooms'] / df['households']
df['bedroom_ratio'] = df['total_bedrooms'] / df['total_rooms']
df['people_per_house'] = df['population'] / df['households']

3. Binning (Grouping Similar Values) #

Instead of using exact numbers, group them into categories.

Example: Age.

Raw age: 25, 32, 18, 65, 70, 22
Binned age: Young (18-30), Middle (31-50), Senior (51+)

Why it works: Model sees patterns more clearly. Buying habits differ by age group, not exact age.

df['age_group'] = pd.cut(df['age'], 
                          bins=[0, 18, 30, 50, 100], 
                          labels=['Child', 'Young', 'Middle', 'Senior'])

4. Polynomial Features (Capturing Non-Linearity) #

Sometimes the relationship is curved, not straight. Polynomial features capture curves.

Example: House price vs house size.

A 1000 sq ft house costs $100 k . A 2000 s q f t h o u s e c o s t s$ 100k.A2000sqfthousecosts200k. A 3000 sq ft house costs $450 k (n o t$ 450k(not300k). The relationship is curved. Adding “size squared” helps capture this curve.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X[['size']])
# Creates: size, size^2

Warning: Too many polynomial features can overfit. Use carefully.

Quick Quiz #

Q1: You have a “timestamp” column. What are three features you can extract?

A1: Hour of day, day of week, is weekend, month, quarter, year (any three).

Q2: You have “total_bathrooms” and “total_bedrooms”. What combination feature could you create?

A2: Bathroom to bedroom ratio. Or total_rooms = bathrooms + bedrooms.

Q3: You are predicting house price. Why is “neighborhood average price” a good feature?

A3: Because houses in expensive neighborhoods tend to be expensive themselves. The neighborhood matters.

Q4: You add polynomial features and validation score drops. What happened?

A4: Overfitting. Polynomial features added too much complexity. Reduce degree or add regularization.

Handling Imbalanced Datasets Feature Selection