Imbalanced Datasets in Machine Learning

Last updated: Jun 11, 2026

Author :

Christy Harshitha Dakarapu

In many real-world Machine Learning problems, some classes appear much more frequently than others. This phenomenon is known as Class Imbalance or Imbalanced Datasets.

Imbalanced datasets are one of the most common challenges in Machine Learning because standard algorithms tend to favor the majority class while ignoring the minority class.

Examples include:

Fraud Detection
Disease Diagnosis
Network Intrusion Detection
Credit Default Prediction
Manufacturing Defect Detection
Spam Detection

In these applications, the minority class is often the most important class, even though it represents only a small fraction of the dataset.

In this article, we will explore imbalanced datasets, understand why they are problematic, learn evaluation metrics beyond accuracy, and study techniques such as oversampling, undersampling, and SMOTE.

What is an Imbalanced Dataset?

A dataset is considered imbalanced when one class contains significantly more samples than another.

Example:

Class	Samples
Non-Fraud	9,900
Fraud	100

Fraudulent transactions represent only:

\frac{100}{10000}=1\%

of the dataset.

This creates a severe imbalance.

Balanced vs Imbalanced Dataset

Balanced Dataset:

Class	Samples
Yes	500
No	500

Imbalanced Dataset:

Class	Samples
Yes	950
No	50

Most real-world datasets are imbalanced.

Why Imbalanced Data is a Problem

Machine Learning models aim to maximize overall accuracy.

In an imbalanced dataset, predicting the majority class all the time may still produce high accuracy.

Example:

Dataset:

Class	Samples
Healthy	990
Diseased	10

Suppose a model predicts:

Healthy for every patient.

Accuracy Calculation

Correct Predictions:

990

Total Predictions:

1000

Accuracy:

Accuracy= \frac{990}{1000} = 99\%

The model achieves 99% accuracy.

However:

It never identifies diseased patients.

This model is practically useless.

The Accuracy Paradox

The Accuracy Paradox occurs when a model achieves high accuracy but performs poorly on the minority class.

Example:

Metric	Value
Accuracy	99%
Fraud Detection Rate	0%

The model appears excellent but completely fails its actual purpose.

Majority Class and Minority Class

In imbalanced datasets:

Majority Class:

Most observations

Minority Class:

Few observations

Example:

Class	Type
Non-Fraud	Majority
Fraud	Minority

Usually, the minority class is the class of interest.

Real-World Examples

Fraud Detection

Class	Percentage
Legitimate	99.8%
Fraudulent	0.2%

Disease Detection

Class	Percentage
Healthy	95%
Diseased	5%

Manufacturing

Class	Percentage
Good Product	99%
Defective Product	1%

Spam Detection

Class	Percentage
Normal Email	90%
Spam Email	10%

Understanding the Confusion Matrix

For imbalanced datasets, confusion matrices become essential.

Example:

	Predicted Positive	Predicted Negative
Actual Positive	TP	FN
Actual Negative	FP	TN

Where:

TP = True Positive
TN = True Negative
FP = False Positive
FN = False Negative

Example Confusion Matrix

	Fraud	Not Fraud
Fraud	80	20
Not Fraud	50	9850

This matrix provides far more information than accuracy alone.

Precision

Precision measures how many predicted positives are actually correct.

Formula:

$Precision=\frac{TP}{TP+FP}$

Example:

Precision= \frac{80}{80+50} = 0.615

Precision:

61.5%

Why Precision Matters

Important when false positives are costly.

Examples:

Spam Detection
Fraud Detection

Recall

Recall measures how many actual positives were detected.

Formula:

$Recall=\frac{TP}{TP+FN}$

Example:

Recall= \frac{80}{80+20} = 0.8

Recall:

80%

Why Recall Matters

Important when missing positives is dangerous.

Examples:

Cancer Detection
Fraud Detection
Security Systems

Precision vs Recall

Metric	Focus
Precision	Minimize False Positives
Recall	Minimize False Negatives

F1 Score

F1 Score combines Precision and Recall.

Formula:

$F1=2\times\frac{Precision\times Recall}{Precision+Recall}$

Advantages:

Balances precision and recall
Better metric for imbalanced datasets

Why Accuracy is Not Enough

Consider:

Metric	Value
Accuracy	99%
Recall	0%

The model completely misses the minority class.

Accuracy hides this failure.

Detecting Class Imbalance

Python:


df["Target"].value_counts()

Output:


0    9500
1     500

Percentage:


df["Target"].value_counts(normalize=True)

Output:


0    0.95
1    0.05

Visualizing Class Imbalance


import seaborn as sns

sns.countplot(x="Target", data=df)

A highly uneven bar chart often indicates imbalance.

Solutions for Imbalanced Datasets

Common approaches include:

Oversampling
Undersampling
Synthetic Sampling
Cost-Sensitive Learning
Ensemble Methods

Oversampling

Oversampling increases minority class samples.

Example:

Before:

Class	Samples
Majority	900
Minority	100

After Oversampling:

Class	Samples
Majority	900
Minority	900

Random Oversampling

Duplicates minority samples.

Python:


from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()

X_resampled, y_resampled = ros.fit_resample(
    X,
    y
)

Advantages of Oversampling

Simple
Retains all data

Disadvantages of Oversampling

Risk of overfitting
Duplicate observations

Undersampling

Undersampling reduces majority class samples.

Example:

Before:

Class	Samples
Majority	9000
Minority	1000

After:

Class	Samples
Majority	1000
Minority	1000

Random Undersampling

Python:


from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()

X_resampled, y_resampled = rus.fit_resample(
    X,
    y
)

Advantages of Undersampling

Faster training
Smaller dataset

Disadvantages of Undersampling

Information loss
Potential underfitting

What is SMOTE?

SMOTE stands for:

Synthetic Minority Oversampling Technique

Instead of duplicating minority samples, SMOTE generates new synthetic samples.

How SMOTE Works

Suppose minority sample:

Age	Income
25	50000

Nearest Neighbor:

Age	Income
28	55000

SMOTE creates:

Age	Income
26.5	52500

Synthetic samples improve diversity.

SMOTE Example


from imblearn.over_sampling import SMOTE

smote = SMOTE()

X_resampled, y_resampled = smote.fit_resample(
    X,
    y
)

Advantages of SMOTE

Reduces overfitting
Creates realistic samples
Widely used

Disadvantages of SMOTE

May create noisy observations
Less effective for complex distributions

ADASYN

ADASYN is an extension of SMOTE.

Full Form:

Adaptive Synthetic Sampling

It generates more synthetic samples for difficult regions.

Advantages:

Better handling of hard-to-learn cases

Class Weighting

Many algorithms allow assigning higher importance to minority classes.

Example:


from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    class_weight="balanced"
)

The algorithm penalizes mistakes on minority classes more heavily.

Cost-Sensitive Learning

Different mistakes have different costs.

Example:

Fraud Detection:

Missing a fraud transaction is more expensive than investigating a legitimate transaction.

Cost-sensitive learning incorporates these penalties.

Ensemble Methods for Imbalanced Data

Popular approaches:

Balanced Random Forest
EasyEnsemble
XGBoost with class weights

These methods often perform very well on imbalanced datasets.

Stratified Train-Test Split

When splitting imbalanced datasets, class proportions should be preserved.

Incorrect:

Random split may distort class distribution.

Correct:


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    stratify=y,
    test_size=0.2
)

Imbalanced Data and Cross Validation

Use Stratified K-Fold.

Benefits:

Preserves class distribution
Produces reliable evaluation

Example:


from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(
    n_splits=5
)

Evaluation Metrics for Imbalanced Data

Metric	Recommended
Accuracy	No
Precision	Yes
Recall	Yes
F1 Score	Yes
ROC-AUC	Yes
PR-AUC	Yes

ROC Curve

ROC Curve evaluates classification performance across different thresholds.

Axes:

True Positive Rate
False Positive Rate

Higher area under the curve indicates better performance.

Precision-Recall Curve

For highly imbalanced datasets, Precision-Recall curves are often more informative than ROC curves.

They focus specifically on minority class performance.

Real-World Applications

Industry	Minority Class
Banking	Fraud Transactions
Healthcare	Disease Cases
Cybersecurity	Intrusions
Manufacturing	Defects
Insurance	Fraud Claims

Common Mistakes

Using Accuracy as Primary Metric

Accuracy often hides poor minority class performance.

Applying SMOTE Before Train-Test Split

Incorrect:


Apply SMOTE
      ↓
Train-Test Split

This causes data leakage.

Correct:


Train-Test Split
      ↓
Apply SMOTE Only on Training Data

Ignoring Business Requirements

Different applications prioritize different metrics.

Example:

Cancer detection should prioritize Recall over Precision.

Best Practices

Analyze class distribution first
Use confusion matrices
Prefer Precision, Recall, and F1 Score
Apply Stratified Splitting
Use SMOTE carefully
Consider class weighting
Evaluate business impact of errors

Imbalanced Dataset Workflow

A typical workflow is:

Analyze class distribution
Perform train-test split
Apply resampling only on training data
Train model
Evaluate using Precision, Recall, and F1 Score
Compare multiple balancing techniques
Deploy best-performing model

Why Imbalanced Datasets Matter

Many of the most important Machine Learning applications involve rare events. Fraud, disease diagnosis, cybersecurity attacks, equipment failures, and manufacturing defects often represent only a tiny fraction of the data, yet they are precisely the cases we care about most.

Understanding how to identify, evaluate, and handle imbalanced datasets is essential for building reliable Machine Learning systems that perform well not only on common cases but also on the rare events that often have the greatest real-world impact.