In the previous articles, we learned about:

  • Curse of Dimensionality
  • PCA
  • t-SNE
  • UMAP

All of these techniques help deal with high-dimensional data.

However, there are two different approaches to reducing dimensionality:

Feature Selection

Feature Extraction

PCA, t-SNE, and UMAP are:

Feature Extraction

techniques.

Now we will study:

Feature Selection

which focuses on choosing the most useful features from the original dataset.

Why Feature Selection?

Suppose a dataset contains:

100 Features

but only:

15 Features

actually influence the target variable.

The remaining features may be:

  • Irrelevant
  • Redundant
  • Noisy

Keeping them can hurt model performance.

What is Feature Selection?

Feature Selection is the process of selecting the most relevant features while removing irrelevant or redundant ones.

Goal:

Keep Useful Features

Remove Unnecessary Features

Example

Dataset:

AgeIncomeFavorite ColorSalary
2540000Blue45000

Suppose we are predicting:

Salary

Age and Income may be useful.

Favorite Color may contribute little.

Feature Selection removes:

Favorite Color

Why Feature Selection Matters

Feature selection helps:

Reduce Overfitting

Less noise.

Improve Accuracy

Important features receive more attention.

Faster Training

Smaller datasets.

Better Interpretability

Models become easier to explain.

Real-Life Analogy

Imagine packing for a trip.

You have:

100 Items

but only:

20 Items

are necessary.

Taking everything:

More Weight

without additional benefit.

Feature selection works similarly.

Feature Selection vs Dimensionality Reduction

Many beginners confuse these concepts.

Feature Selection

Chooses existing features.

Example:

Age

Income

Education

Keep:

Age

Income

Feature Extraction

Creates new features.

Example:

PC1

PC2

generated by PCA.

Comparison

Feature SelectionFeature Extraction
Keeps Original FeaturesCreates New Features
More InterpretableLess Interpretable
SimplerMore Complex
No TransformationTransformation Required

Categories of Feature Selection

Most methods belong to three categories:

Filter Methods

Wrapper Methods

Embedded Methods

1. Filter Methods

Filter methods evaluate features independently of any machine learning model.

Idea:

Feature

Score

Keep Best Ones

Advantages

  • Fast
  • Simple
  • Scalable

Disadvantages

  • Ignore feature interactions

Common Filter Methods

Correlation

Chi-Square Test

ANOVA

Mutual Information

Correlation-Based Selection

Measures relationship between features and target.

Example:

FeatureCorrelation
Income0.85
Age0.40
Favorite Color0.01

Income is highly informative.

Favorite Color can be removed.

Correlation Formula

The Pearson correlation coefficient is:

Values close to:

+1

or

-1

indicate strong relationships.

Chi-Square Test

Used primarily for:

Categorical Features

Measures dependence between feature and target.

Large Chi-Square value:

Important Feature

Mutual Information

Measures information shared between:

Feature

Target

Higher information:

More Useful Feature

2. Wrapper Methods

Wrapper methods evaluate feature subsets using an actual machine learning model.

Idea:

Select Features

Train Model

Evaluate Performance

Advantages

  • Often highly accurate
  • Considers feature interactions

Disadvantages

  • Computationally expensive

Common Wrapper Methods

Forward Selection

Backward Elimination

Recursive Feature Elimination (RFE)

Forward Selection

Start with:

No Features

Add features one at a time.

Keep features that improve performance.

Workflow:

None

Feature A

Feature A+B

Feature A+B+C

Backward Elimination

Start with:

All Features

Remove features one at a time.

Workflow:

A+B+C+D

A+B+C

A+B

Recursive Feature Elimination (RFE)

Popular wrapper technique.

Process:

Train Model

Remove Weakest Feature

Retrain

Repeat

Example

Features:

Age

Income

Education

Zip Code

RFE may remove:

Zip Code

if it contributes little.

3. Embedded Methods

Embedded methods perform feature selection during model training.

Idea:

Train Model

Select Features Automatically

Advantages

  • Faster than wrappers
  • More accurate than simple filters

Examples

Lasso Regression

Decision Trees

Random Forest

XGBoost

Lasso Feature Selection

Lasso uses:

L1 Regularization

Some coefficients become:

Exactly Zero

Features with zero coefficients are removed.

Decision Tree Feature Importance

Trees naturally identify useful features.

Example:

FeatureImportance
Income0.45
Age0.30
Education0.20
Zip Code0.05

Zip Code may be removed.

Random Forest Feature Importance

Random Forest combines multiple trees.

Provides robust importance estimates.

XGBoost Feature Importance

Measures:

  • Gain
  • Weight
  • Cover

to rank features.

Feature Importance Example

Dataset:

House Prices

Features:

FeatureImportance
Area0.50
Location0.30
Age0.15
Color of Door0.05

The last feature contributes little.

Filter vs Wrapper vs Embedded

MethodSpeedAccuracyComplexity
FilterFastModerateLow
WrapperSlowHighHigh
EmbeddedMediumHighMedium

Example: Customer Churn

Features:

  • Age
  • Income
  • Tenure
  • Customer ID

Feature Selection may remove:

Customer ID

because it contains little predictive information.

Example: Healthcare

Dataset:

500 Medical Features

Feature Selection may reduce:

500 → 50

important features.

Example: Text Classification

Thousands of words may exist.

Only a subset contributes meaningfully.

Feature selection reduces dimensionality significantly.

Python Example: Correlation

correlation = df.corr()

Chi-Square Selection

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

Recursive Feature Elimination

from sklearn.feature_selection import RFE
rfe = RFE(model, n_features_to_select=5)

Random Forest Importance

model.feature_importances_