Pandas Cleaning Data

Last updated: Jan 25, 2026

Author :

Lingeshk K V

What Is Data Cleaning in Pandas?

Data cleaning is the process of identifying and fixing errors, inconsistencies, and missing values in a dataset. Clean data is essential for accurate analysis and reliable results.

Pandas provides powerful and easy-to-use functions to clean real-world data efficiently.

Why Data Cleaning Is Important

Dirty data can cause:

Incorrect analysis results
Model errors in machine learning
Misleading insights
Application failures

Cleaning data ensures:

Accuracy
Consistency
Reliability

Load Sample Data

Python

import pandas as pd df = pd.read_csv("data.csv")

1. Handling Missing Values

Detect Missing Values

Python

print(df.isnull()) print(df.isnull().sum()) # Output: # Boolean DataFrame # Count of missing values per column

Remove Rows with Missing Values

Python

df_clean = df.dropna() print(df_clean) # Output: # DataFrame without missing rows

Fill Missing Values

Python

df["Age"] = df["Age"].fillna(df["Age"].mean()) print(df) # Output: # Missing Age values replaced with mean

2. Removing Duplicate Data

Detect Duplicates

Python

print(df.duplicated()) print(df.duplicated().sum()) # Output: # Boolean series # Number of duplicate rows

Remove Duplicates

Python

df = df.drop_duplicates() print(df) # Output: # Duplicate rows removed

3. Fixing Data Types

Incorrect data types can affect calculations.

Python

df["Age"] = df["Age"].astype(int) print(df.dtypes) # Output: # Corrected data types

4. Renaming Columns

Standardizing column names improves readability.

Python

df = df.rename(columns={"Full Name": "Name", "DOB": "DateOfBirth"}) print(df.columns) # Output: # Updated column names

5. Removing Unnecessary Columns

Python

df = df.drop("Unnamed: 0", axis=1) print(df) # Output: # Column removed

6. Handling Outliers

Python

print(df["Salary"].describe()) # Output: # Summary statistics to identify outliers

Filter extreme values:

Python

df = df[df["Salary"] < 100000] print(df) # Output: # Outliers removed

7. Standardizing Text Data

Python

df["City"] = df["City"].str.strip().str.lower() print(df["City"]) # Output: # Standardized text

8. Replacing Incorrect Values

Python

df["Gender"] = df["Gender"].replace({"M": "Male", "F": "Female"}) print(df["Gender"]) # Output: # Corrected values

9. Resetting Index

After removing rows, reset index for clarity.

Python

df = df.reset_index(drop=True) print(df) # Output: # Clean index

Key Points to Remember

Always analyze data before cleaning
Handle missing values appropriately
Remove duplicates early
Fix incorrect data types
Standardize text data
Clean data improves analysis accuracy

Pandas Cleaning Data

What Is Data Cleaning in Pandas?

Why Data Cleaning Is Important

Load Sample Data

1. Handling Missing Values

Detect Missing Values

Remove Rows with Missing Values

Fill Missing Values

2. Removing Duplicate Data

Detect Duplicates

Remove Duplicates

3. Fixing Data Types

4. Renaming Columns

5. Removing Unnecessary Columns

6. Handling Outliers

7. Standardizing Text Data

8. Replacing Incorrect Values

9. Resetting Index

Key Points to Remember

Frequently Asked Questions