Feature Selection

Feature Selection in Machine Learning #
“More features does not mean better model. Extra features add noise. They slow down training. They cause overfitting. Feature selection is about keeping what matters and throwing away the rest.”

Why Feature Selection? #

Problem	Without Selection	With Selection
Training time	Slow	Fast
Overfitting	High risk	Low risk
Model complexity	High	Low
Interpretability	Hard	Easy

Simple truth: A model with 10 good features beats a model with 100 average features.

The Three Main Methods #

Method	How it works	Speed	Accuracy
Filter	Statistical tests before training	Very Fast	Moderate
Wrapper	Train model repeatedly to test features	Slow	High
Embedded	Model selects features during training	Medium	High

Method 1: Filter Methods (Fastest) #

How it works: You rank features by a statistical score. Keep top k features. No model training involved.

Common filter techniques:

Technique	What it measures	Best for
Correlation	Linear relationship with target	Regression
Chi-Square	Dependence between variables	Classification
Mutual Information	Any relationship (linear or not)	Both
Variance Threshold	Remove constant or near-constant features	Both

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# Keep top 10 features based on ANOVA F-test
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Keep features with variance above threshold
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

Pros: Very fast. Works as preprocessing step.
Cons: Ignores feature interactions. A feature might be useless alone but powerful when combined.

Method 2: Wrapper Methods (Most Accurate) #

How it works: You train a model multiple times with different feature combinations. Keep the combination that gives best score.

Common wrapper techniques:

Technique	How it works	Complexity
Forward Selection	Start empty, add best feature one by one	O(n²)
Backward Elimination	Start with all, remove worst feature one by one	O(n²)
Recursive Elimination (RFE)	Train model, remove weakest features, repeat	O(n²)

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Recursive Feature Elimination
estimator = RandomForestClassifier()
selector = RFE(estimator, n_features_to_select=10)
selector.fit(X, y)

# Selected features
selected_mask = selector.support_

Pros: Finds best combination. Considers feature interactions.
Cons: Slow. Can overfit if not careful.

Warning: Do not use wrapper methods with 1000+ features. It will take forever.

Which Method Should You Choose? #

Your Situation	Best Method	Why
Very large dataset (10k+ features)	Filter	Fastest, good enough
Small dataset (< 500 features)	Wrapper (RFE)	Most accurate
You want simplicity	Embedded (Random Forest importance)	Easy, works well
You have linear data	Embedded (Lasso)	Forces zeros, removes features
You need quick baseline	Filter (SelectKBest)	Fast, no training

Default recommendation: Start with Random Forest feature importance (Embedded). If still too many features, use SelectKBest (Filter) to reduce further.

The Golden Rule #

Feature selection must be done on training data only. Then apply same selection to validation and test.

Correct way:

selector.fit(X_train, y_train)
X_train_selected = selector.transform(X_train)
X_val_selected = selector.transform(X_val)
X_test_selected = selector.transform(X_test)

Wrong way:

# DO NOT DO THIS - data leakage
selector.fit(X_train, y_train)
selector.fit(X_val, y_val)  # WRONG!

How Many Features to Keep? #

No fixed answer. General guidelines:

Total Features	Keep (roughly)
< 50	Keep all, no selection needed
50 – 500	Keep top 50-100
500 – 5000	Keep top 100-200
5000+	Keep top 200-300

Better approach: Plot feature importance. Look for “elbow”. Keep features before importance drops sharply.

Quick Example (Complete Picture) #

Problem: 500 features. Want to reduce to 50.

Step 1 (Filter – Fast reduction):

selector = SelectKBest(mutual_info_classif, k=100)
X_selected = selector.fit_transform(X, y)

Step 2 (Embedded – Refine further):

rf = RandomForestClassifier()
rf.fit(X_selected, y)
importances = rf.feature_importances_
# Keep top 50 based on importance

Result: 500 features → 100 → 50 good features.

Quick Quiz #

Q1: You have 10,000 features and 5,000 samples. Which method should you use first?

A1: Filter method (like SelectKBest). It is fast and will reduce to manageable size quickly.

Q2: You want features that are important for a Random Forest model. Which method?

A2: Embedded method using Random Forest’s feature_importances_.

Q3: Which method actually sets some feature weights to zero?

A3: Lasso regression (L1 regularization) from Embedded methods.

Q4: You ran RFE and it took 2 days to complete. What went wrong?

A4: Too many features. Use Filter method first to reduce, then RFE.

Key Takeaways (5 Lines) #

Filter = Statistical test before training. Fastest. Good for large datasets.
Wrapper = Train model repeatedly. Most accurate. Use for small datasets.
Embedded = Model selects during training. Best balance. Start here.
Always fit selector on training data only. Never on validation or test.
Fewer good features > Many bad features. Feature selection prevents overfitting.

Feature Engineering for ML Scikit-learn Pipeline