What Is Data Cleaning in Pandas?

Data cleaning is the process of identifying and fixing errors, inconsistencies, and missing values in a dataset. Clean data is essential for accurate analysis and reliable results.

Pandas provides powerful and easy-to-use functions to clean real-world data efficiently.

Why Data Cleaning Is Important

Dirty data can cause:

  • Incorrect analysis results

  • Model errors in machine learning

  • Misleading insights

  • Application failures

Cleaning data ensures:

  • Accuracy

  • Consistency

  • Reliability

Load Sample Data

Python
import pandas as pd df = pd.read_csv("data.csv")

1. Handling Missing Values

Detect Missing Values

Python
print(df.isnull()) print(df.isnull().sum()) # Output: # Boolean DataFrame # Count of missing values per column

Remove Rows with Missing Values

Python
df_clean = df.dropna() print(df_clean) # Output: # DataFrame without missing rows

Fill Missing Values

Python
df["Age"] = df["Age"].fillna(df["Age"].mean()) print(df) # Output: # Missing Age values replaced with mean

2. Removing Duplicate Data

Detect Duplicates

Python
print(df.duplicated()) print(df.duplicated().sum()) # Output: # Boolean series # Number of duplicate rows

Remove Duplicates

Python
df = df.drop_duplicates() print(df) # Output: # Duplicate rows removed

3. Fixing Data Types

Incorrect data types can affect calculations.

Python
df["Age"] = df["Age"].astype(int) print(df.dtypes) # Output: # Corrected data types

4. Renaming Columns

Standardizing column names improves readability.

Python
df = df.rename(columns={"Full Name": "Name", "DOB": "DateOfBirth"}) print(df.columns) # Output: # Updated column names

5. Removing Unnecessary Columns

Python
df = df.drop("Unnamed: 0", axis=1) print(df) # Output: # Column removed

6. Handling Outliers

Python
print(df["Salary"].describe()) # Output: # Summary statistics to identify outliers

Filter extreme values:

Python
df = df[df["Salary"] < 100000] print(df) # Output: # Outliers removed

7. Standardizing Text Data

Python
df["City"] = df["City"].str.strip().str.lower() print(df["City"]) # Output: # Standardized text

8. Replacing Incorrect Values

Python
df["Gender"] = df["Gender"].replace({"M": "Male", "F": "Female"}) print(df["Gender"]) # Output: # Corrected values

9. Resetting Index

After removing rows, reset index for clarity.

Python
df = df.reset_index(drop=True) print(df) # Output: # Clean index

Key Points to Remember

  • Always analyze data before cleaning

  • Handle missing values appropriately

  • Remove duplicates early

  • Fix incorrect data types

  • Standardize text data

  • Clean data improves analysis accuracy