Missing Value Handling in Machine Learning

Last updated: Jun 11, 2026

Author :

Christy Harshitha Dakarapu

Missing values are one of the most common challenges encountered in real-world Machine Learning projects. Almost every dataset collected from surveys, sensors, databases, APIs, web applications, or business systems contains some form of incomplete data.

Before building Machine Learning models, it is essential to identify and handle missing values properly because most algorithms cannot work directly with incomplete datasets.

Poor handling of missing values can lead to:

Reduced model accuracy
Biased predictions
Incorrect statistical analysis
Training failures
Poor business decisions

Organizations such as Google, Amazon, Netflix, Meta, Microsoft, and OpenAI spend significant effort cleaning and preparing data before model training.

In this article, we will explore missing values in detail, understand why they occur, learn different handling techniques, and implement practical examples using Python.

What are Missing Values?

Missing values refer to data points that are unavailable, unknown, or not recorded.

Example dataset:

Name	Age	Salary
John	25	50000
Alice	28	NULL
Bob	NULL	45000
Sarah	30	55000

Here:

Alice's salary is missing
Bob's age is missing

These missing entries must be handled before training Machine Learning models.

Why Missing Values Occur

Missing values can arise due to many reasons.

Common causes include:

Human data entry errors
Survey respondents skipping questions
Sensor malfunctions
Database migration issues
Data corruption
API failures
Network interruptions

Why Missing Values are a Problem

Machine Learning algorithms expect complete data.

Missing values may cause:

Model training failures
Incorrect statistical calculations
Biased predictions
Reduced generalization

Example:

Suppose salary values are missing mainly for high-income individuals.

Removing those records may create a biased dataset.

Types of Missing Data

Understanding why data is missing is important.

Missing data is generally classified into three categories.

Type	Description
MCAR	Missing Completely At Random
MAR	Missing At Random
MNAR	Missing Not At Random

Missing Completely At Random (MCAR)

Missingness has no relationship with any variable.

Example:

A survey sheet accidentally gets damaged during transportation.

The missing data occurs randomly.

Advantages:

Simplest case to handle
Causes minimal bias

Missing At Random (MAR)

Missingness depends on another observed variable.

Example:

Older users are less likely to provide email addresses.

Missing email data depends on age.

Most real-world datasets belong to this category.

Missing Not At Random (MNAR)

Missingness depends on the missing value itself.

Example:

People with very high salaries may choose not to disclose income.

Missing salary depends on salary itself.

This is the most difficult type to handle.

Identifying Missing Values

Common missing value representations include:

NULL
NaN
None
Empty strings
Special placeholders

Example:


import pandas as pd

data = {
    "Age": [22, 25, None, 30]
}

df = pd.DataFrame(data)

print(df)

Output:


    Age
0  22.0
1  25.0
2   NaN
3  30.0

Detecting Missing Values

Pandas provides built-in functions for detecting missing values.


df.isnull()

Output:


     Age
0  False
1  False
2   True
3  False

Counting Missing Values


df.isnull().sum()

Output:


Age    1
dtype: int64

Calculating Missing Value Percentage


missing_percentage = df.isnull().mean() * 100

print(missing_percentage)

Output:


Age    25.0

This indicates that 25% of the values are missing.

Visualizing Missing Values

Missing values can be visualized using heatmaps.


import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull())

plt.show()

Visualization often helps identify patterns of missingness.

Strategies for Handling Missing Values

There is no single best solution.

The appropriate strategy depends on:

Dataset size
Missing percentage
Business requirements
Feature importance

Common approaches include:

Deletion
Imputation
Model-based methods

Method 1: Removing Missing Values

The simplest approach is deleting rows or columns containing missing values.

Row Deletion


df.dropna()

Example:

Age	Salary
25	50000
NaN	45000

After deletion:

Age	Salary
25	50000

Advantages of Row Deletion

Simple
Fast
Easy implementation

Disadvantages of Row Deletion

Loss of information
Smaller dataset
Potential bias

When to Use Row Deletion

Suitable when:

Missing percentage is very low
Dataset is large

Column Deletion

Columns with excessive missing values may be removed.


df.drop(columns=["Salary"])

When to Remove Columns

Typically considered when:

More than 60–70% values are missing
Feature has low business importance

Method 2: Mean Imputation

Missing numerical values are replaced with the mean.

Formula:

$Mean=\frac{\sum x_i}{n}$

Example:

Age
20
25
NaN
35

Mean:

\frac{20+25+35}{3}=26.67

Replace NaN with 26.67.

Python implementation:


df["Age"].fillna(df["Age"].mean())

Advantages of Mean Imputation

Easy implementation
Fast computation

Disadvantages of Mean Imputation

Reduces variance
Sensitive to outliers

Method 3: Median Imputation

Median is often preferred when outliers exist.

Example:

Income
30000
35000
40000
500000

Median:

37500

Python implementation:


df["Income"].fillna(df["Income"].median())

Why Median is Often Better

Median is robust against extreme values.

Widely used for:

Salary
Income
House prices

Method 4: Mode Imputation

Used for categorical variables.

Example:

City
Delhi
Mumbai
Delhi
NaN

Mode = Delhi

Python:


df["City"].fillna(df["City"].mode()[0])

Method 5: Constant Value Imputation

Missing values are replaced with fixed values.

Examples:


df.fillna(0)


df.fillna("Unknown")

Common applications:

Missing categories
Survey responses

Method 6: Forward Fill

Previous value is propagated forward.


df.fillna(method="ffill")

Example:

Temperature
25
NaN
NaN
28

Result:

Temperature
25
25
25
28

Method 7: Backward Fill

Next value is propagated backward.


df.fillna(method="bfill")

Time Series Missing Value Handling

Time-series datasets often use:

Forward Fill
Backward Fill
Interpolation

Interpolation

Interpolation estimates missing values from neighboring values.


df.interpolate()

Example:

Value
10
NaN
30

Interpolated value:

Advanced Missing Value Imputation

Simple methods are not always sufficient.

Advanced techniques include:

KNN Imputation
Regression Imputation
Multiple Imputation

KNN Imputation

Uses neighboring observations to estimate missing values.

Example:


from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3)

df_imputed = imputer.fit_transform(df)

Advantages of KNN Imputation

Preserves relationships
Better accuracy

Disadvantages of KNN Imputation

Computationally expensive
Slower on large datasets

Regression Imputation

A predictive model estimates missing values.

Example:

Predict salary using:

Age
Experience
Education

Advantages:

More accurate

Disadvantages:

Can introduce bias

Multiple Imputation

Creates multiple estimates for missing values.

Popular method:

MICE (Multiple Imputation by Chained Equations)

Advantages:

Statistically robust

Disadvantages:

Computationally intensive

Missing Value Handling Using Scikit-Learn

Simple Imputer is widely used.


from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="mean")

df["Age"] = imputer.fit_transform(df[["Age"]])

Supported strategies:

Strategy	Description
mean	Mean replacement
median	Median replacement
most_frequent	Mode replacement
constant	Fixed value

Choosing the Right Strategy

Scenario	Recommended Method
Few missing rows	Row deletion
Numerical data	Mean or Median
Categorical data	Mode
Time-series data	Forward Fill
Complex patterns	KNN Imputation
High accuracy required	Multiple Imputation

Impact on Machine Learning Models

Poor missing value handling can cause:

Underfitting
Overfitting
Biased predictions
Reduced accuracy

Proper imputation often improves model performance significantly.

Real-World Examples

Industry	Missing Data Example
Healthcare	Missing patient records
Finance	Missing transaction data
Retail	Missing customer demographics
IoT	Sensor failures
Education	Missing student information

Best Practices for Handling Missing Values

Understand why data is missing
Measure missing percentage first
Avoid blindly deleting data
Use median for skewed distributions
Use domain knowledge
Compare multiple imputation strategies
Validate model performance after imputation

Missing Value Handling Workflow

A typical workflow is:

Detect missing values
Calculate missing percentages
Analyze missing patterns
Choose imputation strategy
Apply transformation
Validate results
Train Machine Learning model

Future of Missing Value Handling in AI

Modern Machine Learning systems increasingly use:

Automated data cleaning
Intelligent imputation
Deep Learning-based imputation
Data quality monitoring systems

As datasets continue growing in size and complexity, robust missing value handling will remain one of the most important preprocessing steps in the Machine Learning lifecycle.