What Are Duplicate Records in Pandas?

Duplicate records are rows that repeat the same data across all or selected columns. Duplicates commonly arise from:

  • Merging datasets

  • Data entry errors

  • Repeated file imports

  • API or logging issues

Removing duplicates is an essential step in data cleaning to ensure accurate analysis.

Why Removing Duplicates Is Important

Duplicate data can:

  • Skew counts and aggregates

  • Bias statistical results

  • Increase storage and processing cost

  • Mislead insights and models

Cleaning duplicates improves data integrity and reliability.

Load Sample Data

Python
import pandas as pd df = pd.DataFrame({ "Name": ["Alice", "Bob", "Alice", "Charlie", "Bob"], "Age": [25, 30, 25, 35, 30], "City": ["Delhi", "Mumbai", "Delhi", "Chennai", "Mumbai"] }) print(df) # Output: # Name Age City # 0 Alice 25 Delhi # 1 Bob 30 Mumbai # 2 Alice 25 Delhi # 3 Charlie 35 Chennai # 4 Bob 30 Mumbai

1. Detect Duplicate Rows

Check Duplicate Status

Python
print(df.duplicated()) # Output: # 0 False # 1 False # 2 True # 3 False # 4 True # dtype: bool

True indicates a duplicate row (based on all columns by default).

Count Duplicate Rows

Python
print(df.duplicated().sum()) # Output: # 2

2. Remove Duplicate Rows

Remove All Duplicates (Keep First)

Python
df_clean = df.drop_duplicates() print(df_clean) # Output: # Name Age City # 0 Alice 25 Delhi # 1 Bob 30 Mumbai # 3 Charlie 35 Chennai

By default, drop_duplicates() keeps the first occurrence.

Keep Last Occurrence

Python
df_clean = df.drop_duplicates(keep="last") print(df_clean) # Output: # Name Age City # 2 Alice 25 Delhi # 3 Charlie 35 Chennai # 4 Bob 30 Mumbai

Remove All Duplicates (Keep None)

Python
df_clean = df.drop_duplicates(keep=False) print(df_clean) # Output: # Name Age City # 3 Charlie 35 Chennai

This removes all rows that have duplicates.

3. Remove Duplicates Based on Specific Columns

Sometimes duplicates should be identified using only selected columns.

Python
df_clean = df.drop_duplicates(subset=["Name"]) print(df_clean) # Output: # Name Age City # 0 Alice 25 Delhi # 1 Bob 30 Mumbai # 3 Charlie 35 Chennai

Here, duplicates are detected based only on the Name column.

4. Remove Duplicates In-Place

Python
df.drop_duplicates(inplace=True) print(df) # Output: # DataFrame without duplicates

Use inplace=True to modify the original DataFrame directly.

5. Reset Index After Removing Duplicates

Python
df = df.reset_index(drop=True) print(df) # Output: # Clean index after duplicate removal

Best Practices

  • Always detect duplicates before removing them

  • Decide whether to keep first, last, or none

  • Use subset carefully for business logic

  • Reset index after cleaning

  • Document duplicate-removal rules