In many real-world Machine Learning problems, some classes appear much more frequently than others. This phenomenon is known as Class Imbalance or Imbalanced Datasets.
Imbalanced datasets are one of the most common challenges in Machine Learning because standard algorithms tend to favor the majority class while ignoring the minority class.
Examples include:
- Fraud Detection
- Disease Diagnosis
- Network Intrusion Detection
- Credit Default Prediction
- Manufacturing Defect Detection
- Spam Detection
In these applications, the minority class is often the most important class, even though it represents only a small fraction of the dataset.
In this article, we will explore imbalanced datasets, understand why they are problematic, learn evaluation metrics beyond accuracy, and study techniques such as oversampling, undersampling, and SMOTE.
What is an Imbalanced Dataset?
A dataset is considered imbalanced when one class contains significantly more samples than another.
Example:
| Class | Samples |
|---|---|
| Non-Fraud | 9,900 |
| Fraud | 100 |
Fraudulent transactions represent only:
of the dataset.
This creates a severe imbalance.
Balanced vs Imbalanced Dataset
Balanced Dataset:
| Class | Samples |
|---|---|
| Yes | 500 |
| No | 500 |
Imbalanced Dataset:
| Class | Samples |
|---|---|
| Yes | 950 |
| No | 50 |
Most real-world datasets are imbalanced.
Why Imbalanced Data is a Problem
Machine Learning models aim to maximize overall accuracy.
In an imbalanced dataset, predicting the majority class all the time may still produce high accuracy.
Example:
Dataset:
| Class | Samples |
|---|---|
| Healthy | 990 |
| Diseased | 10 |
Suppose a model predicts:
Healthy for every patient.
Accuracy Calculation
Correct Predictions:
990
Total Predictions:
1000
Accuracy:
The model achieves 99% accuracy.
However:
It never identifies diseased patients.
This model is practically useless.
The Accuracy Paradox
The Accuracy Paradox occurs when a model achieves high accuracy but performs poorly on the minority class.
Example:
| Metric | Value |
|---|---|
| Accuracy | 99% |
| Fraud Detection Rate | 0% |
The model appears excellent but completely fails its actual purpose.
Majority Class and Minority Class
In imbalanced datasets:
Majority Class:
- Most observations
Minority Class:
- Few observations
Example:
| Class | Type |
|---|---|
| Non-Fraud | Majority |
| Fraud | Minority |
Usually, the minority class is the class of interest.
Real-World Examples
Fraud Detection
| Class | Percentage |
|---|---|
| Legitimate | 99.8% |
| Fraudulent | 0.2% |
Disease Detection
| Class | Percentage |
|---|---|
| Healthy | 95% |
| Diseased | 5% |
Manufacturing
| Class | Percentage |
|---|---|
| Good Product | 99% |
| Defective Product | 1% |
Spam Detection
| Class | Percentage |
|---|---|
| Normal Email | 90% |
| Spam Email | 10% |
Understanding the Confusion Matrix
For imbalanced datasets, confusion matrices become essential.
Example:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP | FN |
| Actual Negative | FP | TN |
Where:
- TP = True Positive
- TN = True Negative
- FP = False Positive
- FN = False Negative
Example Confusion Matrix
| Fraud | Not Fraud | |
|---|---|---|
| Fraud | 80 | 20 |
| Not Fraud | 50 | 9850 |
This matrix provides far more information than accuracy alone.
Precision
Precision measures how many predicted positives are actually correct.
Formula:
Example:
Precision:
61.5%
Why Precision Matters
Important when false positives are costly.
Examples:
- Spam Detection
- Fraud Detection
Recall
Recall measures how many actual positives were detected.
Formula:
Example:
Recall:
80%
Why Recall Matters
Important when missing positives is dangerous.
Examples:
- Cancer Detection
- Fraud Detection
- Security Systems
Precision vs Recall
| Metric | Focus |
|---|---|
| Precision | Minimize False Positives |
| Recall | Minimize False Negatives |
F1 Score
F1 Score combines Precision and Recall.
Formula:
Advantages:
- Balances precision and recall
- Better metric for imbalanced datasets
Why Accuracy is Not Enough
Consider:
| Metric | Value |
|---|---|
| Accuracy | 99% |
| Recall | 0% |
The model completely misses the minority class.
Accuracy hides this failure.
Detecting Class Imbalance
Python:
df["Target"].value_counts()
Output:
0 9500
1 500
Percentage:
df["Target"].value_counts(normalize=True)
Output:
0 0.95
1 0.05
Visualizing Class Imbalance
import seaborn as sns
sns.countplot(x="Target", data=df)
A highly uneven bar chart often indicates imbalance.
Solutions for Imbalanced Datasets
Common approaches include:
- Oversampling
- Undersampling
- Synthetic Sampling
- Cost-Sensitive Learning
- Ensemble Methods
Oversampling
Oversampling increases minority class samples.
Example:
Before:
| Class | Samples |
|---|---|
| Majority | 900 |
| Minority | 100 |
After Oversampling:
| Class | Samples |
|---|---|
| Majority | 900 |
| Minority | 900 |
Random Oversampling
Duplicates minority samples.
Python:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
X_resampled, y_resampled = ros.fit_resample(
X,
y
)
Advantages of Oversampling
- Simple
- Retains all data
Disadvantages of Oversampling
- Risk of overfitting
- Duplicate observations
Undersampling
Undersampling reduces majority class samples.
Example:
Before:
| Class | Samples |
|---|---|
| Majority | 9000 |
| Minority | 1000 |
After:
| Class | Samples |
|---|---|
| Majority | 1000 |
| Minority | 1000 |
Random Undersampling
Python:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(
X,
y
)
Advantages of Undersampling
- Faster training
- Smaller dataset
Disadvantages of Undersampling
- Information loss
- Potential underfitting
What is SMOTE?
SMOTE stands for:
Synthetic Minority Oversampling Technique
Instead of duplicating minority samples, SMOTE generates new synthetic samples.
How SMOTE Works
Suppose minority sample:
| Age | Income |
|---|---|
| 25 | 50000 |
Nearest Neighbor:
| Age | Income |
|---|---|
| 28 | 55000 |
SMOTE creates:
| Age | Income |
|---|---|
| 26.5 | 52500 |
Synthetic samples improve diversity.
SMOTE Example
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(
X,
y
)
Advantages of SMOTE
- Reduces overfitting
- Creates realistic samples
- Widely used
Disadvantages of SMOTE
- May create noisy observations
- Less effective for complex distributions
ADASYN
ADASYN is an extension of SMOTE.
Full Form:
Adaptive Synthetic Sampling
It generates more synthetic samples for difficult regions.
Advantages:
- Better handling of hard-to-learn cases
Class Weighting
Many algorithms allow assigning higher importance to minority classes.
Example:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(
class_weight="balanced"
)
The algorithm penalizes mistakes on minority classes more heavily.
Cost-Sensitive Learning
Different mistakes have different costs.
Example:
Fraud Detection:
Missing a fraud transaction is more expensive than investigating a legitimate transaction.
Cost-sensitive learning incorporates these penalties.
Ensemble Methods for Imbalanced Data
Popular approaches:
- Balanced Random Forest
- EasyEnsemble
- XGBoost with class weights
These methods often perform very well on imbalanced datasets.
Stratified Train-Test Split
When splitting imbalanced datasets, class proportions should be preserved.
Incorrect:
Random split may distort class distribution.
Correct:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
stratify=y,
test_size=0.2
)
Imbalanced Data and Cross Validation
Use Stratified K-Fold.
Benefits:
- Preserves class distribution
- Produces reliable evaluation
Example:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(
n_splits=5
)
Evaluation Metrics for Imbalanced Data
| Metric | Recommended |
|---|---|
| Accuracy | No |
| Precision | Yes |
| Recall | Yes |
| F1 Score | Yes |
| ROC-AUC | Yes |
| PR-AUC | Yes |
ROC Curve
ROC Curve evaluates classification performance across different thresholds.
Axes:
- True Positive Rate
- False Positive Rate
Higher area under the curve indicates better performance.
Precision-Recall Curve
For highly imbalanced datasets, Precision-Recall curves are often more informative than ROC curves.
They focus specifically on minority class performance.
Real-World Applications
| Industry | Minority Class |
|---|---|
| Banking | Fraud Transactions |
| Healthcare | Disease Cases |
| Cybersecurity | Intrusions |
| Manufacturing | Defects |
| Insurance | Fraud Claims |
Common Mistakes
Using Accuracy as Primary Metric
Accuracy often hides poor minority class performance.
Applying SMOTE Before Train-Test Split
Incorrect:
Apply SMOTE
↓
Train-Test Split
This causes data leakage.
Correct:
Train-Test Split
↓
Apply SMOTE Only on Training Data
Ignoring Business Requirements
Different applications prioritize different metrics.
Example:
Cancer detection should prioritize Recall over Precision.
Best Practices
- Analyze class distribution first
- Use confusion matrices
- Prefer Precision, Recall, and F1 Score
- Apply Stratified Splitting
- Use SMOTE carefully
- Consider class weighting
- Evaluate business impact of errors
Imbalanced Dataset Workflow
A typical workflow is:
- Analyze class distribution
- Perform train-test split
- Apply resampling only on training data
- Train model
- Evaluate using Precision, Recall, and F1 Score
- Compare multiple balancing techniques
- Deploy best-performing model
Why Imbalanced Datasets Matter
Many of the most important Machine Learning applications involve rare events. Fraud, disease diagnosis, cybersecurity attacks, equipment failures, and manufacturing defects often represent only a tiny fraction of the data, yet they are precisely the cases we care about most.
Understanding how to identify, evaluate, and handle imbalanced datasets is essential for building reliable Machine Learning systems that perform well not only on common cases but also on the rare events that often have the greatest real-world impact.