What Are Duplicate Records in Pandas?
Duplicate records are rows that repeat the same data across all or selected columns. Duplicates commonly arise from:
-
Merging datasets
-
Data entry errors
-
Repeated file imports
-
API or logging issues
Removing duplicates is an essential step in data cleaning to ensure accurate analysis.
Why Removing Duplicates Is Important
Duplicate data can:
-
Skew counts and aggregates
-
Bias statistical results
-
Increase storage and processing cost
-
Mislead insights and models
Cleaning duplicates improves data integrity and reliability.
Load Sample Data
import pandas as pd df = pd.DataFrame({ "Name": ["Alice", "Bob", "Alice", "Charlie", "Bob"], "Age": [25, 30, 25, 35, 30], "City": ["Delhi", "Mumbai", "Delhi", "Chennai", "Mumbai"] }) print(df) # Output: # Name Age City # 0 Alice 25 Delhi # 1 Bob 30 Mumbai # 2 Alice 25 Delhi # 3 Charlie 35 Chennai # 4 Bob 30 Mumbai1. Detect Duplicate Rows
Check Duplicate Status
print(df.duplicated()) # Output: # 0 False # 1 False # 2 True # 3 False # 4 True # dtype: boolTrue indicates a duplicate row (based on all columns by default).
Count Duplicate Rows
print(df.duplicated().sum()) # Output: # 22. Remove Duplicate Rows
Remove All Duplicates (Keep First)
df_clean = df.drop_duplicates() print(df_clean) # Output: # Name Age City # 0 Alice 25 Delhi # 1 Bob 30 Mumbai # 3 Charlie 35 ChennaiBy default, drop_duplicates() keeps the first occurrence.
Keep Last Occurrence
df_clean = df.drop_duplicates(keep="last") print(df_clean) # Output: # Name Age City # 2 Alice 25 Delhi # 3 Charlie 35 Chennai # 4 Bob 30 MumbaiRemove All Duplicates (Keep None)
df_clean = df.drop_duplicates(keep=False) print(df_clean) # Output: # Name Age City # 3 Charlie 35 ChennaiThis removes all rows that have duplicates.
3. Remove Duplicates Based on Specific Columns
Sometimes duplicates should be identified using only selected columns.
df_clean = df.drop_duplicates(subset=["Name"]) print(df_clean) # Output: # Name Age City # 0 Alice 25 Delhi # 1 Bob 30 Mumbai # 3 Charlie 35 ChennaiHere, duplicates are detected based only on the Name column.
4. Remove Duplicates In-Place
df.drop_duplicates(inplace=True) print(df) # Output: # DataFrame without duplicatesUse inplace=True to modify the original DataFrame directly.
5. Reset Index After Removing Duplicates
df = df.reset_index(drop=True) print(df) # Output: # Clean index after duplicate removal Best Practices
-
Always detect duplicates before removing them
-
Decide whether to keep first, last, or none
-
Use
subsetcarefully for business logic -
Reset index after cleaning
-
Document duplicate-removal rules