Pandas Removing Duplicates

Last updated: Jan 26, 2026

Author :

Lingeshk K V

What Are Duplicate Records in Pandas?

Duplicate records are rows that repeat the same data across all or selected columns. Duplicates commonly arise from:

Merging datasets
Data entry errors
Repeated file imports
API or logging issues

Removing duplicates is an essential step in data cleaning to ensure accurate analysis.

Why Removing Duplicates Is Important

Duplicate data can:

Skew counts and aggregates
Bias statistical results
Increase storage and processing cost
Mislead insights and models

Cleaning duplicates improves data integrity and reliability.

Load Sample Data

Python

import pandas as pd df = pd.DataFrame({ "Name": ["Alice", "Bob", "Alice", "Charlie", "Bob"], "Age": [25, 30, 25, 35, 30], "City": ["Delhi", "Mumbai", "Delhi", "Chennai", "Mumbai"] }) print(df) # Output: # Name Age City # 0 Alice 25 Delhi # 1 Bob 30 Mumbai # 2 Alice 25 Delhi # 3 Charlie 35 Chennai # 4 Bob 30 Mumbai

1. Detect Duplicate Rows

Check Duplicate Status

Python

print(df.duplicated()) # Output: # 0 False # 1 False # 2 True # 3 False # 4 True # dtype: bool

True indicates a duplicate row (based on all columns by default).

Count Duplicate Rows

Python

print(df.duplicated().sum()) # Output: # 2

2. Remove Duplicate Rows

Remove All Duplicates (Keep First)

Python

df_clean = df.drop_duplicates() print(df_clean) # Output: # Name Age City # 0 Alice 25 Delhi # 1 Bob 30 Mumbai # 3 Charlie 35 Chennai

By default, drop_duplicates() keeps the first occurrence.

Keep Last Occurrence

Python

df_clean = df.drop_duplicates(keep="last") print(df_clean) # Output: # Name Age City # 2 Alice 25 Delhi # 3 Charlie 35 Chennai # 4 Bob 30 Mumbai

Remove All Duplicates (Keep None)

Python

df_clean = df.drop_duplicates(keep=False) print(df_clean) # Output: # Name Age City # 3 Charlie 35 Chennai

This removes all rows that have duplicates.

3. Remove Duplicates Based on Specific Columns

Sometimes duplicates should be identified using only selected columns.

Python

df_clean = df.drop_duplicates(subset=["Name"]) print(df_clean) # Output: # Name Age City # 0 Alice 25 Delhi # 1 Bob 30 Mumbai # 3 Charlie 35 Chennai

Here, duplicates are detected based only on the Name column.

4. Remove Duplicates In-Place

Python

df.drop_duplicates(inplace=True) print(df) # Output: # DataFrame without duplicates

Use inplace=True to modify the original DataFrame directly.

5. Reset Index After Removing Duplicates

Python

df = df.reset_index(drop=True) print(df) # Output: # Clean index after duplicate removal

Best Practices

Always detect duplicates before removing them
Decide whether to keep first, last, or none
Use subset carefully for business logic
Reset index after cleaning
Document duplicate-removal rules