Feature Selection Methods in Machine Learning

Last updated: Jun 16, 2026

Author :

Christy Harshitha Dakarapu

In the previous articles, we learned about:

Curse of Dimensionality
PCA
t-SNE
UMAP

All of these techniques help deal with high-dimensional data.

However, there are two different approaches to reducing dimensionality:


Feature Selection

Feature Extraction

PCA, t-SNE, and UMAP are:


Feature Extraction

techniques.

Now we will study:


Feature Selection

which focuses on choosing the most useful features from the original dataset.

Why Feature Selection?

Suppose a dataset contains:


100 Features

but only:


15 Features

actually influence the target variable.

The remaining features may be:

Irrelevant
Redundant
Noisy

Keeping them can hurt model performance.

What is Feature Selection?

Feature Selection is the process of selecting the most relevant features while removing irrelevant or redundant ones.

Goal:


Keep Useful Features

Remove Unnecessary Features

Example

Dataset:

Age	Income	Favorite Color	Salary
25	40000	Blue	45000

Suppose we are predicting:


Salary

Age and Income may be useful.

Favorite Color may contribute little.

Feature Selection removes:


Favorite Color

Why Feature Selection Matters

Feature selection helps:

Reduce Overfitting

Less noise.

Improve Accuracy

Important features receive more attention.

Faster Training

Smaller datasets.

Better Interpretability

Models become easier to explain.

Real-Life Analogy

Imagine packing for a trip.

You have:


100 Items

but only:


20 Items

are necessary.

Taking everything:


More Weight

without additional benefit.

Feature selection works similarly.

Feature Selection vs Dimensionality Reduction

Many beginners confuse these concepts.

Feature Selection

Chooses existing features.

Example:


Age

Income

Education

Keep:


Age

Income

Feature Extraction

Creates new features.

Example:


PC1

PC2

generated by PCA.

Comparison

Feature Selection	Feature Extraction
Keeps Original Features	Creates New Features
More Interpretable	Less Interpretable
Simpler	More Complex
No Transformation	Transformation Required

Categories of Feature Selection

Most methods belong to three categories:


Filter Methods

Wrapper Methods

Embedded Methods

1. Filter Methods

Filter methods evaluate features independently of any machine learning model.

Idea:


Feature
     ↓
Score
     ↓
Keep Best Ones

Advantages

Fast
Simple
Scalable

Disadvantages

Ignore feature interactions

Common Filter Methods

Correlation

Chi-Square Test

ANOVA

Mutual Information

Correlation-Based Selection

Measures relationship between features and target.

Example:

Feature	Correlation
Income	0.85
Age	0.40
Favorite Color	0.01

Income is highly informative.

Favorite Color can be removed.

Correlation Formula

The Pearson correlation coefficient is:

Values close to:

+1

-1

indicate strong relationships.

Chi-Square Test

Used primarily for:


Categorical Features

Measures dependence between feature and target.

Large Chi-Square value:


Important Feature

Mutual Information

Measures information shared between:


Feature

Target

Higher information:


More Useful Feature

2. Wrapper Methods

Wrapper methods evaluate feature subsets using an actual machine learning model.

Idea:


Select Features
       ↓
Train Model
       ↓
Evaluate Performance

Advantages

Often highly accurate
Considers feature interactions

Disadvantages

Computationally expensive

Common Wrapper Methods

Forward Selection

Backward Elimination

Recursive Feature Elimination (RFE)

Forward Selection

Start with:


No Features

Add features one at a time.

Keep features that improve performance.

Workflow:


None
 ↓
Feature A
 ↓
Feature A+B
 ↓
Feature A+B+C

Backward Elimination

Start with:


All Features

Remove features one at a time.

Workflow:


A+B+C+D
     ↓
A+B+C
     ↓
A+B

Recursive Feature Elimination (RFE)

Popular wrapper technique.

Process:


Train Model
      ↓
Remove Weakest Feature
      ↓
Retrain
      ↓
Repeat

Example

Features:


Age

Income

Education

Zip Code

RFE may remove:


Zip Code

if it contributes little.

3. Embedded Methods

Embedded methods perform feature selection during model training.

Idea:


Train Model
      ↓
Select Features Automatically

Advantages

Faster than wrappers
More accurate than simple filters

Examples

Lasso Regression

Decision Trees

Random Forest

XGBoost

Lasso Feature Selection

Lasso uses:


L1 Regularization

Some coefficients become:


Exactly Zero

Features with zero coefficients are removed.

Decision Tree Feature Importance

Trees naturally identify useful features.

Example:

Feature	Importance
Income	0.45
Age	0.30
Education	0.20
Zip Code	0.05

Zip Code may be removed.

Random Forest Feature Importance

Random Forest combines multiple trees.

Provides robust importance estimates.

XGBoost Feature Importance

Measures:

Gain
Weight
Cover

to rank features.

Feature Importance Example

Dataset:


House Prices

Features:

Feature	Importance
Area	0.50
Location	0.30
Age	0.15
Color of Door	0.05

The last feature contributes little.

Filter vs Wrapper vs Embedded

Method	Speed	Accuracy	Complexity
Filter	Fast	Moderate	Low
Wrapper	Slow	High	High
Embedded	Medium	High	Medium

Example: Customer Churn

Features:

Age
Income
Tenure
Customer ID

Feature Selection may remove:


Customer ID

because it contains little predictive information.

Example: Healthcare

Dataset:


500 Medical Features

Feature Selection may reduce:


500 → 50

important features.

Example: Text Classification

Thousands of words may exist.

Only a subset contributes meaningfully.

Feature selection reduces dimensionality significantly.

Python Example: Correlation


correlation = df.corr()

Chi-Square Selection


from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

Recursive Feature Elimination


from sklearn.feature_selection import RFE


rfe = RFE(model, n_features_to_select=5)

Random Forest Importance


model.feature_importances_