In the previous articles, we learned about:
- Curse of Dimensionality
- PCA
- t-SNE
- UMAP
All of these techniques help deal with high-dimensional data.
However, there are two different approaches to reducing dimensionality:
Feature Selection
Feature Extraction
PCA, t-SNE, and UMAP are:
Feature Extraction
techniques.
Now we will study:
Feature Selection
which focuses on choosing the most useful features from the original dataset.
Why Feature Selection?
Suppose a dataset contains:
100 Features
but only:
15 Features
actually influence the target variable.
The remaining features may be:
- Irrelevant
- Redundant
- Noisy
Keeping them can hurt model performance.
What is Feature Selection?
Feature Selection is the process of selecting the most relevant features while removing irrelevant or redundant ones.
Goal:
Keep Useful Features
Remove Unnecessary Features
Example
Dataset:
| Age | Income | Favorite Color | Salary |
|---|---|---|---|
| 25 | 40000 | Blue | 45000 |
Suppose we are predicting:
Salary
Age and Income may be useful.
Favorite Color may contribute little.
Feature Selection removes:
Favorite Color
Why Feature Selection Matters
Feature selection helps:
Reduce Overfitting
Less noise.
Improve Accuracy
Important features receive more attention.
Faster Training
Smaller datasets.
Better Interpretability
Models become easier to explain.
Real-Life Analogy
Imagine packing for a trip.
You have:
100 Items
but only:
20 Items
are necessary.
Taking everything:
More Weight
without additional benefit.
Feature selection works similarly.
Feature Selection vs Dimensionality Reduction
Many beginners confuse these concepts.
Feature Selection
Chooses existing features.
Example:
Age
Income
Education
Keep:
Age
Income
Feature Extraction
Creates new features.
Example:
PC1
PC2
generated by PCA.
Comparison
| Feature Selection | Feature Extraction |
|---|---|
| Keeps Original Features | Creates New Features |
| More Interpretable | Less Interpretable |
| Simpler | More Complex |
| No Transformation | Transformation Required |
Categories of Feature Selection
Most methods belong to three categories:
Filter Methods
Wrapper Methods
Embedded Methods
1. Filter Methods
Filter methods evaluate features independently of any machine learning model.
Idea:
Feature
↓
Score
↓
Keep Best Ones
Advantages
- Fast
- Simple
- Scalable
Disadvantages
- Ignore feature interactions
Common Filter Methods
Correlation
Chi-Square Test
ANOVA
Mutual Information
Correlation-Based Selection
Measures relationship between features and target.
Example:
| Feature | Correlation |
|---|---|
| Income | 0.85 |
| Age | 0.40 |
| Favorite Color | 0.01 |
Income is highly informative.
Favorite Color can be removed.
Correlation Formula
The Pearson correlation coefficient is:
Values close to:
+1
or
-1
indicate strong relationships.
Chi-Square Test
Used primarily for:
Categorical Features
Measures dependence between feature and target.
Large Chi-Square value:
Important Feature
Mutual Information
Measures information shared between:
Feature
Target
Higher information:
More Useful Feature
2. Wrapper Methods
Wrapper methods evaluate feature subsets using an actual machine learning model.
Idea:
Select Features
↓
Train Model
↓
Evaluate Performance
Advantages
- Often highly accurate
- Considers feature interactions
Disadvantages
- Computationally expensive
Common Wrapper Methods
Forward Selection
Backward Elimination
Recursive Feature Elimination (RFE)
Forward Selection
Start with:
No Features
Add features one at a time.
Keep features that improve performance.
Workflow:
None
↓
Feature A
↓
Feature A+B
↓
Feature A+B+C
Backward Elimination
Start with:
All Features
Remove features one at a time.
Workflow:
A+B+C+D
↓
A+B+C
↓
A+B
Recursive Feature Elimination (RFE)
Popular wrapper technique.
Process:
Train Model
↓
Remove Weakest Feature
↓
Retrain
↓
Repeat
Example
Features:
Age
Income
Education
Zip Code
RFE may remove:
Zip Code
if it contributes little.
3. Embedded Methods
Embedded methods perform feature selection during model training.
Idea:
Train Model
↓
Select Features Automatically
Advantages
- Faster than wrappers
- More accurate than simple filters
Examples
Lasso Regression
Decision Trees
Random Forest
XGBoost
Lasso Feature Selection
Lasso uses:
L1 Regularization
Some coefficients become:
Exactly Zero
Features with zero coefficients are removed.
Decision Tree Feature Importance
Trees naturally identify useful features.
Example:
| Feature | Importance |
|---|---|
| Income | 0.45 |
| Age | 0.30 |
| Education | 0.20 |
| Zip Code | 0.05 |
Zip Code may be removed.
Random Forest Feature Importance
Random Forest combines multiple trees.
Provides robust importance estimates.
XGBoost Feature Importance
Measures:
- Gain
- Weight
- Cover
to rank features.
Feature Importance Example
Dataset:
House Prices
Features:
| Feature | Importance |
|---|---|
| Area | 0.50 |
| Location | 0.30 |
| Age | 0.15 |
| Color of Door | 0.05 |
The last feature contributes little.
Filter vs Wrapper vs Embedded
| Method | Speed | Accuracy | Complexity |
|---|---|---|---|
| Filter | Fast | Moderate | Low |
| Wrapper | Slow | High | High |
| Embedded | Medium | High | Medium |
Example: Customer Churn
Features:
- Age
- Income
- Tenure
- Customer ID
Feature Selection may remove:
Customer ID
because it contains little predictive information.
Example: Healthcare
Dataset:
500 Medical Features
Feature Selection may reduce:
500 → 50
important features.
Example: Text Classification
Thousands of words may exist.
Only a subset contributes meaningfully.
Feature selection reduces dimensionality significantly.
Python Example: Correlation
correlation = df.corr()
Chi-Square Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
Recursive Feature Elimination
from sklearn.feature_selection import RFE
rfe = RFE(model, n_features_to_select=5)
Random Forest Importance
model.feature_importances_