What Is Data Cleaning in Pandas?
Data cleaning is the process of identifying and fixing errors, inconsistencies, and missing values in a dataset. Clean data is essential for accurate analysis and reliable results.
Pandas provides powerful and easy-to-use functions to clean real-world data efficiently.
Why Data Cleaning Is Important
Dirty data can cause:
-
Incorrect analysis results
-
Model errors in machine learning
-
Misleading insights
-
Application failures
Cleaning data ensures:
-
Accuracy
-
Consistency
-
Reliability
Load Sample Data
import pandas as pd df = pd.read_csv("data.csv")1. Handling Missing Values
Detect Missing Values
print(df.isnull()) print(df.isnull().sum()) # Output: # Boolean DataFrame # Count of missing values per columnRemove Rows with Missing Values
df_clean = df.dropna() print(df_clean) # Output: # DataFrame without missing rowsFill Missing Values
df["Age"] = df["Age"].fillna(df["Age"].mean()) print(df) # Output: # Missing Age values replaced with mean2. Removing Duplicate Data
Detect Duplicates
print(df.duplicated()) print(df.duplicated().sum()) # Output: # Boolean series # Number of duplicate rowsRemove Duplicates
df = df.drop_duplicates() print(df) # Output: # Duplicate rows removed3. Fixing Data Types
Incorrect data types can affect calculations.
df["Age"] = df["Age"].astype(int) print(df.dtypes) # Output: # Corrected data types4. Renaming Columns
Standardizing column names improves readability.
df = df.rename(columns={"Full Name": "Name", "DOB": "DateOfBirth"}) print(df.columns) # Output: # Updated column names5. Removing Unnecessary Columns
df = df.drop("Unnamed: 0", axis=1) print(df) # Output: # Column removed6. Handling Outliers
print(df["Salary"].describe()) # Output: # Summary statistics to identify outliersFilter extreme values:
df = df[df["Salary"] < 100000] print(df) # Output: # Outliers removed7. Standardizing Text Data
df["City"] = df["City"].str.strip().str.lower() print(df["City"]) # Output: # Standardized text8. Replacing Incorrect Values
df["Gender"] = df["Gender"].replace({"M": "Male", "F": "Female"}) print(df["Gender"]) # Output: # Corrected values9. Resetting Index
After removing rows, reset index for clarity.
df = df.reset_index(drop=True) print(df) # Output: # Clean index Key Points to Remember
-
Always analyze data before cleaning
-
Handle missing values appropriately
-
Remove duplicates early
-
Fix incorrect data types
-
Standardize text data
-
Clean data improves analysis accuracy