ExamAdda LogoExamAdda
PremiumDSA Animations
SYSTEM DESIGN (LLD)SYSTEM DESIGN (HLD)DSAMACHINE LEARNINGGENAIMYSQLOSARTIFICIAL INTELLIGENCEINTERVIEWDBMSDEEP LEARNINGROADMAPSJAVASCRIPTPYTHON

Basic ML

  • RoadMap
  • Foundations of ML
  • Python for ML
  • SQL for Data & ML
  • Mathematics for ML

Data Processing

  • Data Preprocessing
    • Data collection methods
    • Missing value handling
    • Outlier detection
    • Feature scaling
    • Normalization vs Standardization
    • Encoding categorical data
    • Feature engineering
    • Feature selection
    • Date-time features
    • Text features
    • Data augmentation basics
    • Train-test-validation split
    • Data leakage
    • Imbalanced datasets
  • Exploratory Data Analysis (EDA)

Supervised Learning

  • Regression Algorithms

Unsupervised Learning

  • Clustering
  • Dimensionality Reduction
  • Association Rule Learning

Reinforcement Learning

  • What is RL?
  • Agent-environment interaction
  • Rewards & policies
  • Markov Decision Process
  • Q-Learning
  • Deep Q Networks
  • Policy Gradient Methods

Advanced ML

  • Time Series Forecasting
  • Anomaly Detection
  • Recommendation Systems
  • Federated Learning
  • AutoML

Model Evaluation

Outlier Detection and Treatment

Last updated: Jun 11, 2026
Author :Christy Harshitha DakarapuChristy Harshitha Dakarapu

Outliers are one of the most common data quality issues encountered in Machine Learning projects. They are observations that differ significantly from the majority of the data and can negatively affect statistical analysis, model training, and prediction performance.

Real-world datasets often contain outliers due to:

  • Data entry mistakes
  • Sensor failures
  • Fraudulent transactions
  • Measurement errors
  • Rare events
  • Genuine extreme observations

Before training Machine Learning models, it is important to identify whether these unusual observations should be removed, transformed, or retained.

Organizations such as Google, Amazon, Netflix, Tesla, and financial institutions spend considerable effort detecting and handling outliers because they can dramatically impact business decisions and model accuracy.

In this article, we will explore outliers in detail, understand detection techniques, treatment methods, and practical implementations using Python.

What is an Outlier?

An outlier is a data point that is significantly different from other observations in the dataset.

Example:

Age
21
23
24
22
25
120

Here:

120 is clearly much larger than the other observations and is considered an outlier.

Why Outliers Matter

Outliers can:

  • Distort statistical measures
  • Affect model accuracy
  • Bias predictions
  • Mislead analysis
  • Increase variance

For example:

Dataset:

[10,12,15,18,20][10, 12, 15, 18, 20][10,12,15,18,20]

Mean:

151515

Now add an outlier:

[10,12,15,18,20,500][10, 12, 15, 18, 20, 500][10,12,15,18,20,500]

New Mean:

95.8395.8395.83

A single outlier dramatically changes the average.

Are Outliers Always Bad?

No.

Outliers can sometimes contain valuable information.

Examples:

  • Fraudulent credit card transactions
  • Rare diseases
  • Stock market crashes
  • Cybersecurity attacks
  • Equipment failures

Removing such observations may eliminate important business insights.

Types of Outliers

Outliers can be categorized into three major types.

TypeDescription
Global OutlierExtreme compared to entire dataset
Contextual OutlierExtreme in specific context
Collective OutlierGroup of unusual observations

Global Outliers

These are obvious outliers when compared to all observations.

Example:

A salary dataset where one employee earns 100 times more than everyone else.

Contextual Outliers

These are abnormal only in specific situations.

Example:

30°C temperature is normal during summer but unusual during winter.

Collective Outliers

A group of observations appears abnormal together.

Example:

A sequence of unusual network activities indicating a cyber attack.

Causes of Outliers

Outliers may arise from:

  • Human errors
  • Data entry mistakes
  • Instrument errors
  • Data corruption
  • Sampling errors
  • Natural variation
  • Rare events

Identifying Outliers

The first step is visual inspection.

Common visualization techniques:

  • Box plots
  • Scatter plots
  • Histograms
  • Density plots

Box Plot for Outlier Detection

Box plots are among the most popular techniques.

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x=df["Salary"])

plt.show()

Outliers appear as points beyond the whiskers.

Histogram Analysis

Histograms can reveal unusual observations.

df["Salary"].hist()

Extreme values often appear isolated from the main distribution.

Scatter Plot Analysis

Scatter plots help identify multivariate outliers.

plt.scatter(df["Age"], df["Salary"])

Points far from clusters may be outliers.

Statistical Methods for Outlier Detection

Several statistical approaches exist.

Popular methods include:

  1. Z-Score Method
  2. IQR Method
  3. Modified Z-Score
  4. Percentile Method

Z-Score Method

The Z-Score measures how many standard deviations a value is from the mean.

Formula:

Z=X−μσZ=\frac{X-\mu}{\sigma}Z=σX−μ​
xxx
μ\muμ
σ\sigmaσ
z=x−μσ≈1.2z=\frac{x-\mu}{\sigma}\approx 1.2z=σx−μ​≈1.2
Φ(z)≈88.5%\Phi(z)\approx 88.5\%Φ(z)≈88.5%

Where:

  • XXX = observation
  • μ\muμ = mean
  • σ\sigmaσ = standard deviation

Z-Score Interpretation

Z-ScoreInterpretation
0Mean value
±1Near average
±2Unusual
±3 or morePotential outlier

Common rule:

∣Z∣>3|Z| > 3∣Z∣>3

indicates an outlier.

Z-Score Example

from scipy import stats

z_scores = stats.zscore(df["Salary"])

outliers = df[abs(z_scores) > 3]

Limitations of Z-Score

Z-Score assumes:

  • approximately normal distribution

It performs poorly for highly skewed datasets.

Interquartile Range (IQR) Method

IQR is one of the most widely used methods.

It is robust against extreme values.

Understanding Quartiles

QuartileMeaning
Q125th percentile
Q2Median
Q375th percentile

IQR Formula

IQR=Q3−Q1IQR = Q3 - Q1IQR=Q3−Q1

Outlier Boundaries

Lower Bound:

Q1−1.5(IQR)Q1 - 1.5(IQR)Q1−1.5(IQR)

Upper Bound:

Q3+1.5(IQR)Q3 + 1.5(IQR)Q3+1.5(IQR)

Values outside these bounds are considered outliers.

IQR Example

Q1 = df["Salary"].quantile(0.25)

Q3 = df["Salary"].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR

upper = Q3 + 1.5 * IQR

outliers = df[
(df["Salary"] < lower) |
(df["Salary"] > upper)
]

Why IQR is Popular

Advantages:

  • Simple
  • Effective
  • Resistant to skewed data
  • No normality assumption

Modified Z-Score

Modified Z-Score uses Median Absolute Deviation (MAD).

Formula:

MZ=0.6745X−MedianMADMZ=0.6745\frac{X-Median}{MAD}MZ=0.6745MADX−Median​

Advantages:

  • More robust than standard Z-score
  • Works better with skewed distributions

Percentile-Based Detection

Outliers may be defined using percentiles.

Example:

  • Below 1st percentile
  • Above 99th percentile
lower = df["Salary"].quantile(0.01)

upper = df["Salary"].quantile(0.99)

Machine Learning Methods for Outlier Detection

Traditional methods are not always sufficient.

Machine Learning algorithms can detect complex anomalies.

Popular methods:

  • Isolation Forest
  • Local Outlier Factor (LOF)
  • One-Class SVM
  • DBSCAN

Isolation Forest

Isolation Forest is one of the most popular anomaly detection algorithms.

Idea:

Outliers are easier to isolate than normal observations.

from sklearn.ensemble import IsolationForest

model = IsolationForest()

df["Outlier"] = model.fit_predict(df)

Output:

Value
1 = Normal
-1 = Outlier

Advantages of Isolation Forest

  • Fast
  • Scalable
  • Works well for large datasets

Local Outlier Factor (LOF)

LOF compares local density.

Outliers typically have much lower density than neighboring points.

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor()

labels = lof.fit_predict(df)

One-Class SVM

One-Class SVM learns normal patterns and identifies anomalies.

from sklearn.svm import OneClassSVM

model = OneClassSVM()

predictions = model.fit_predict(df)

DBSCAN for Outlier Detection

DBSCAN is a clustering algorithm.

Points not belonging to any cluster are treated as outliers.

from sklearn.cluster import DBSCAN

dbscan = DBSCAN()

labels = dbscan.fit_predict(df)

Noise points receive label:

-1

Handling Outliers

After detection, appropriate treatment is required.

Common approaches:

  1. Removal
  2. Transformation
  3. Capping
  4. Imputation

Method 1: Removing Outliers

Simply remove abnormal observations.

df = df[
(df["Salary"] >= lower) &
(df["Salary"] <= upper)
]

Advantages:

  • Simple

Disadvantages:

  • Loss of information

Method 2: Winsorization (Capping)

Replace extreme values with boundary values.

Example:

Values above upper bound become:

Upper BoundUpper\ BoundUpper Bound

Python:

df["Salary"] = df["Salary"].clip(lower, upper)

Advantages:

  • Preserves dataset size
  • Reduces outlier impact

Method 3: Log Transformation

Useful for highly skewed distributions.

Formula:

y=log⁡(x)y=\log(x)y=log(x)

Python:

import numpy as np

df["Salary"] = np.log(df["Salary"])

Why Log Transformation Works

It compresses large values and reduces skewness.

Common applications:

  • Income
  • Revenue
  • House prices

Method 4: Square Root Transformation

Formula:

y=xy=\sqrt{x}y=x​

Useful for moderately skewed distributions.

Method 5: Replace Using Median

Extreme values can be replaced by median.

median = df["Salary"].median()

df["Salary"] = df["Salary"].replace(outlier_value, median)

Outlier Detection in Multivariate Data

Some observations may appear normal individually but abnormal when considering multiple variables together.

Example:

Age = 5

Salary = ₹1,00,00,000

Individually:

  • Age might be valid
  • Salary might be valid

Together:

  • Highly suspicious

Machine Learning methods are often preferred for multivariate detection.

Outliers and Machine Learning Models

Different models react differently.

ModelSensitivity
Linear RegressionVery High
Logistic RegressionHigh
KNNHigh
Neural NetworksModerate
Decision TreesLow
Random ForestLow

Real-World Applications

IndustryOutlier Example
BankingFraudulent transactions
HealthcareAbnormal patient readings
ManufacturingDefective products
CybersecurityNetwork intrusions
RetailUnusual purchasing behavior

Best Practices for Outlier Handling

  • Investigate outliers before removing
  • Use visualization first
  • Consider business context
  • Compare multiple methods
  • Avoid blindly deleting records
  • Document treatment decisions

Outlier Detection Workflow

A typical workflow is:

  1. Visualize data
  2. Calculate descriptive statistics
  3. Detect outliers
  4. Analyze business significance
  5. Select treatment strategy
  6. Validate model performance
  7. Train Machine Learning model

Future of Outlier Detection in AI

Modern AI systems increasingly use:

  • Deep Learning anomaly detection
  • Real-time monitoring systems
  • Automated data quality pipelines
  • Self-learning anomaly detectors

As datasets continue growing in scale and complexity, effective outlier detection and treatment will remain a critical component of data preprocessing and Machine Learning pipelines.


Previous Tutorial
Missing value handling
Next Tutorial
Feature scaling
ExamAdda LogoExamAdda Tech

Your comprehensive destination for learning programming, web development, data science, and modern technologies. Master coding with our in-depth tutorials and practical examples.

Support

  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of Service

Connect With Us

Follow us on social media for the latest tutorials, tips, and programming updates.

© 2026 ExamAdda Tech. All rights reserved.

Privacy PolicyTerms of ServiceCookie Policy