View Categories

Feature Selection

Feature Selection in Machine Learning #

“More features does not mean better model. Extra features add noise. They slow down training. They cause overfitting. Feature selection is about keeping what matters and throwing away the rest.”

Why Feature Selection? #

ProblemWithout SelectionWith Selection
Training timeSlowFast
OverfittingHigh riskLow risk
Model complexityHighLow
InterpretabilityHardEasy

Simple truth: A model with 10 good features beats a model with 100 average features.

Feature Selection

The Three Main Methods #

MethodHow it worksSpeedAccuracy
FilterStatistical tests before trainingVery FastModerate
WrapperTrain model repeatedly to test featuresSlowHigh
EmbeddedModel selects features during trainingMediumHigh

Method 1: Filter Methods (Fastest) #

How it works: You rank features by a statistical score. Keep top k features. No model training involved.

Common filter techniques:

TechniqueWhat it measuresBest for
CorrelationLinear relationship with targetRegression
Chi-SquareDependence between variablesClassification
Mutual InformationAny relationship (linear or not)Both
Variance ThresholdRemove constant or near-constant featuresBoth
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# Keep top 10 features based on ANOVA F-test
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Keep features with variance above threshold
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

Pros: Very fast. Works as preprocessing step.
Cons: Ignores feature interactions. A feature might be useless alone but powerful when combined.

Feature Selection 2

Method 2: Wrapper Methods (Most Accurate) #

How it works: You train a model multiple times with different feature combinations. Keep the combination that gives best score.

Common wrapper techniques:

TechniqueHow it worksComplexity
Forward SelectionStart empty, add best feature one by oneO(n²)
Backward EliminationStart with all, remove worst feature one by oneO(n²)
Recursive Elimination (RFE)Train model, remove weakest features, repeatO(n²)
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Recursive Feature Elimination
estimator = RandomForestClassifier()
selector = RFE(estimator, n_features_to_select=10)
selector.fit(X, y)

# Selected features
selected_mask = selector.support_

Pros: Finds best combination. Considers feature interactions.
Cons: Slow. Can overfit if not careful.

Warning: Do not use wrapper methods with 1000+ features. It will take forever.

Feature Selection 4

Which Method Should You Choose? #

Your SituationBest MethodWhy
Very large dataset (10k+ features)FilterFastest, good enough
Small dataset (< 500 features)Wrapper (RFE)Most accurate
You want simplicityEmbedded (Random Forest importance)Easy, works well
You have linear dataEmbedded (Lasso)Forces zeros, removes features
You need quick baselineFilter (SelectKBest)Fast, no training

Default recommendation: Start with Random Forest feature importance (Embedded). If still too many features, use SelectKBest (Filter) to reduce further.

Feature Selection 6

The Golden Rule #

Feature selection must be done on training data only. Then apply same selection to validation and test.

Correct way:

selector.fit(X_train, y_train)
X_train_selected = selector.transform(X_train)
X_val_selected = selector.transform(X_val)
X_test_selected = selector.transform(X_test)

Wrong way:

# DO NOT DO THIS - data leakage
selector.fit(X_train, y_train)
selector.fit(X_val, y_val)  # WRONG!

How Many Features to Keep? #

No fixed answer. General guidelines:

Total FeaturesKeep (roughly)
< 50Keep all, no selection needed
50 – 500Keep top 50-100
500 – 5000Keep top 100-200
5000+Keep top 200-300

Better approach: Plot feature importance. Look for “elbow”. Keep features before importance drops sharply.

Quick Example (Complete Picture) #

Problem: 500 features. Want to reduce to 50.

Step 1 (Filter – Fast reduction):

selector = SelectKBest(mutual_info_classif, k=100)
X_selected = selector.fit_transform(X, y)

Step 2 (Embedded – Refine further):

rf = RandomForestClassifier()
rf.fit(X_selected, y)
importances = rf.feature_importances_
# Keep top 50 based on importance

Result: 500 features → 100 → 50 good features.

Quick Quiz #

Q1: You have 10,000 features and 5,000 samples. Which method should you use first?

A1: Filter method (like SelectKBest). It is fast and will reduce to manageable size quickly.

Q2: You want features that are important for a Random Forest model. Which method?

A2: Embedded method using Random Forest’s feature_importances_.

Q3: Which method actually sets some feature weights to zero?

A3: Lasso regression (L1 regularization) from Embedded methods.

Q4: You ran RFE and it took 2 days to complete. What went wrong?

A4: Too many features. Use Filter method first to reduce, then RFE.

Key Takeaways (5 Lines) #

  1. Filter = Statistical test before training. Fastest. Good for large datasets.
  2. Wrapper = Train model repeatedly. Most accurate. Use for small datasets.
  3. Embedded = Model selects during training. Best balance. Start here.
  4. Always fit selector on training data only. Never on validation or test.
  5. Fewer good features > Many bad features. Feature selection prevents overfitting.

💬
AIRA (AI Research Assistant) Neural Learning Interface • Drag & Resize Enabled
×