In many real-world Machine Learning problems, some classes appear much more frequently than others. This phenomenon is known as Class Imbalance or Imbalanced Datasets.

Imbalanced datasets are one of the most common challenges in Machine Learning because standard algorithms tend to favor the majority class while ignoring the minority class.

Examples include:

  • Fraud Detection
  • Disease Diagnosis
  • Network Intrusion Detection
  • Credit Default Prediction
  • Manufacturing Defect Detection
  • Spam Detection

In these applications, the minority class is often the most important class, even though it represents only a small fraction of the dataset.

In this article, we will explore imbalanced datasets, understand why they are problematic, learn evaluation metrics beyond accuracy, and study techniques such as oversampling, undersampling, and SMOTE.

What is an Imbalanced Dataset?

A dataset is considered imbalanced when one class contains significantly more samples than another.

Example:

ClassSamples
Non-Fraud9,900
Fraud100

Fraudulent transactions represent only:

10010000=1%\frac{100}{10000}=1\%

of the dataset.

This creates a severe imbalance.

Balanced vs Imbalanced Dataset

Balanced Dataset:

ClassSamples
Yes500
No500

Imbalanced Dataset:

ClassSamples
Yes950
No50

Most real-world datasets are imbalanced.

Why Imbalanced Data is a Problem

Machine Learning models aim to maximize overall accuracy.

In an imbalanced dataset, predicting the majority class all the time may still produce high accuracy.

Example:

Dataset:

ClassSamples
Healthy990
Diseased10

Suppose a model predicts:

Healthy for every patient.

Accuracy Calculation

Correct Predictions:

990

Total Predictions:

1000

Accuracy:

Accuracy=9901000=99%Accuracy= \frac{990}{1000} = 99\%

The model achieves 99% accuracy.

However:

It never identifies diseased patients.

This model is practically useless.

The Accuracy Paradox

The Accuracy Paradox occurs when a model achieves high accuracy but performs poorly on the minority class.

Example:

MetricValue
Accuracy99%
Fraud Detection Rate0%

The model appears excellent but completely fails its actual purpose.

Majority Class and Minority Class

In imbalanced datasets:

Majority Class:

  • Most observations

Minority Class:

  • Few observations

Example:

ClassType
Non-FraudMajority
FraudMinority

Usually, the minority class is the class of interest.

Real-World Examples

Fraud Detection

ClassPercentage
Legitimate99.8%
Fraudulent0.2%

Disease Detection

ClassPercentage
Healthy95%
Diseased5%

Manufacturing

ClassPercentage
Good Product99%
Defective Product1%

Spam Detection

ClassPercentage
Normal Email90%
Spam Email10%

Understanding the Confusion Matrix

For imbalanced datasets, confusion matrices become essential.

Example:

Predicted PositivePredicted Negative
Actual PositiveTPFN
Actual NegativeFPTN

Where:

  • TP = True Positive
  • TN = True Negative
  • FP = False Positive
  • FN = False Negative

Example Confusion Matrix

FraudNot Fraud
Fraud8020
Not Fraud509850

This matrix provides far more information than accuracy alone.

Precision

Precision measures how many predicted positives are actually correct.

Formula:

Precision=TPTP+FPPrecision=\frac{TP}{TP+FP}

Example:

Precision=8080+50=0.615Precision= \frac{80}{80+50} = 0.615

Precision:

61.5%

Why Precision Matters

Important when false positives are costly.

Examples:

  • Spam Detection
  • Fraud Detection

Recall

Recall measures how many actual positives were detected.

Formula:

Recall=TPTP+FNRecall=\frac{TP}{TP+FN}

Example:

Recall=8080+20=0.8Recall= \frac{80}{80+20} = 0.8

Recall:

80%

Why Recall Matters

Important when missing positives is dangerous.

Examples:

  • Cancer Detection
  • Fraud Detection
  • Security Systems

Precision vs Recall

MetricFocus
PrecisionMinimize False Positives
RecallMinimize False Negatives

F1 Score

F1 Score combines Precision and Recall.

Formula:

F1=2×Precision×RecallPrecision+RecallF1=2\times\frac{Precision\times Recall}{Precision+Recall}

Advantages:

  • Balances precision and recall
  • Better metric for imbalanced datasets

Why Accuracy is Not Enough

Consider:

MetricValue
Accuracy99%
Recall0%

The model completely misses the minority class.

Accuracy hides this failure.

Detecting Class Imbalance

Python:

df["Target"].value_counts()

Output:

0    9500
1 500

Percentage:

df["Target"].value_counts(normalize=True)

Output:

0    0.95
1 0.05

Visualizing Class Imbalance

import seaborn as sns

sns.countplot(x="Target", data=df)

A highly uneven bar chart often indicates imbalance.

Solutions for Imbalanced Datasets

Common approaches include:

  1. Oversampling
  2. Undersampling
  3. Synthetic Sampling
  4. Cost-Sensitive Learning
  5. Ensemble Methods

Oversampling

Oversampling increases minority class samples.

Example:

Before:

ClassSamples
Majority900
Minority100

After Oversampling:

ClassSamples
Majority900
Minority900

Random Oversampling

Duplicates minority samples.

Python:

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()

X_resampled, y_resampled = ros.fit_resample(
X,
y
)

Advantages of Oversampling

  • Simple
  • Retains all data

Disadvantages of Oversampling

  • Risk of overfitting
  • Duplicate observations

Undersampling

Undersampling reduces majority class samples.

Example:

Before:

ClassSamples
Majority9000
Minority1000

After:

ClassSamples
Majority1000
Minority1000

Random Undersampling

Python:

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()

X_resampled, y_resampled = rus.fit_resample(
X,
y
)

Advantages of Undersampling

  • Faster training
  • Smaller dataset

Disadvantages of Undersampling

  • Information loss
  • Potential underfitting

What is SMOTE?

SMOTE stands for:

Synthetic Minority Oversampling Technique

Instead of duplicating minority samples, SMOTE generates new synthetic samples.

How SMOTE Works

Suppose minority sample:

AgeIncome
2550000

Nearest Neighbor:

AgeIncome
2855000

SMOTE creates:

AgeIncome
26.552500

Synthetic samples improve diversity.

SMOTE Example

from imblearn.over_sampling import SMOTE

smote = SMOTE()

X_resampled, y_resampled = smote.fit_resample(
X,
y
)

Advantages of SMOTE

  • Reduces overfitting
  • Creates realistic samples
  • Widely used

Disadvantages of SMOTE

  • May create noisy observations
  • Less effective for complex distributions

ADASYN

ADASYN is an extension of SMOTE.

Full Form:

Adaptive Synthetic Sampling

It generates more synthetic samples for difficult regions.

Advantages:

  • Better handling of hard-to-learn cases

Class Weighting

Many algorithms allow assigning higher importance to minority classes.

Example:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
class_weight="balanced"
)

The algorithm penalizes mistakes on minority classes more heavily.

Cost-Sensitive Learning

Different mistakes have different costs.

Example:

Fraud Detection:

Missing a fraud transaction is more expensive than investigating a legitimate transaction.

Cost-sensitive learning incorporates these penalties.

Ensemble Methods for Imbalanced Data

Popular approaches:

  • Balanced Random Forest
  • EasyEnsemble
  • XGBoost with class weights

These methods often perform very well on imbalanced datasets.

Stratified Train-Test Split

When splitting imbalanced datasets, class proportions should be preserved.

Incorrect:

Random split may distort class distribution.

Correct:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X,
y,
stratify=y,
test_size=0.2
)

Imbalanced Data and Cross Validation

Use Stratified K-Fold.

Benefits:

  • Preserves class distribution
  • Produces reliable evaluation

Example:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(
n_splits=5
)

Evaluation Metrics for Imbalanced Data

MetricRecommended
AccuracyNo
PrecisionYes
RecallYes
F1 ScoreYes
ROC-AUCYes
PR-AUCYes

ROC Curve

ROC Curve evaluates classification performance across different thresholds.

Axes:

  • True Positive Rate
  • False Positive Rate

Higher area under the curve indicates better performance.

Precision-Recall Curve

For highly imbalanced datasets, Precision-Recall curves are often more informative than ROC curves.

They focus specifically on minority class performance.

Real-World Applications

IndustryMinority Class
BankingFraud Transactions
HealthcareDisease Cases
CybersecurityIntrusions
ManufacturingDefects
InsuranceFraud Claims

Common Mistakes

Using Accuracy as Primary Metric

Accuracy often hides poor minority class performance.

Applying SMOTE Before Train-Test Split

Incorrect:

Apply SMOTE

Train-Test Split

This causes data leakage.

Correct:

Train-Test Split

Apply SMOTE Only on Training Data

Ignoring Business Requirements

Different applications prioritize different metrics.

Example:

Cancer detection should prioritize Recall over Precision.

Best Practices

  • Analyze class distribution first
  • Use confusion matrices
  • Prefer Precision, Recall, and F1 Score
  • Apply Stratified Splitting
  • Use SMOTE carefully
  • Consider class weighting
  • Evaluate business impact of errors

Imbalanced Dataset Workflow

A typical workflow is:

  1. Analyze class distribution
  2. Perform train-test split
  3. Apply resampling only on training data
  4. Train model
  5. Evaluate using Precision, Recall, and F1 Score
  6. Compare multiple balancing techniques
  7. Deploy best-performing model

Why Imbalanced Datasets Matter

Many of the most important Machine Learning applications involve rare events. Fraud, disease diagnosis, cybersecurity attacks, equipment failures, and manufacturing defects often represent only a tiny fraction of the data, yet they are precisely the cases we care about most.

Understanding how to identify, evaluate, and handle imbalanced datasets is essential for building reliable Machine Learning systems that perform well not only on common cases but also on the rare events that often have the greatest real-world impact.