Pandas Cleaning Wrong Data

Last updated: Jan 26, 2026

Author :

Lingeshk K V

What Is Wrong Data in Pandas?

Wrong data refers to values that are incorrect, invalid, inconsistent, or unrealistic, even though they may not be missing. Examples include:

Negative age values
Salary values beyond a realistic range
Misspelled or inconsistent text categories
Invalid entries like "Unknown", "NA" in numeric columns
Data that violates business rules

Cleaning wrong data focuses on correcting or removing invalid values, not just handling missing ones.

Why Cleaning Wrong Data Is Important

Wrong data can:

Produce misleading analysis
Corrupt statistical results
Cause machine learning models to fail
Lead to wrong business decisions

Cleaning ensures accuracy, consistency, and trustworthiness of your dataset.

Load Sample Data

Python

import pandas as pd df = pd.read_csv("data.csv") print(df)

1. Identify Wrong or Invalid Data

Use Summary Statistics

Python

print(df.describe())

This helps detect:

Negative values where not expected
Extremely large or small values
Unusual ranges

Check Unique Values

Python

print(df["Gender"].unique()) # Output: # ['Male' 'F' 'female' 'M' 'Unknown']

2. Fix Out-of-Range Values

Example: Age should be between 0 and 120.

Python

df.loc[(df["Age"] < 0) | (df["Age"] > 120), "Age"] = pd.NA print(df) # Output: # Invalid age values replaced with NaN

3. Replace Wrong Values

Correct known incorrect values using replace().

Python

df["Gender"] = df["Gender"].replace({ "M": "Male", "F": "Female", "female": "Female", "Unknown": pd.NA }) print(df["Gender"])

4. Clean Inconsistent Text Data

Standardize text values.

Python

df["City"] = df["City"].str.strip().str.title() print(df["City"]) # Output: # Consistent city names

5. Remove Invalid Characters

Example: Numeric column with symbols.

Python

df["Salary"] = df["Salary"].str.replace("₹", "").str.replace(",", "") df["Salary"] = pd.to_numeric(df["Salary"], errors="coerce") print(df["Salary"])

6. Validate Data Using Conditions

Keep only valid rows.

Python

df = df[df["Salary"] > 0] print(df) # Output: # Rows with valid salary values

7. Correct Data Using Business Rules

Example: Experience should not exceed age.

Python

df.loc[df["Experience"] > df["Age"], "Experience"] = df["Age"] print(df)

8. Handle Wrong Categorical Data

Replace rare or invalid categories.

Python

valid_cities = ["Delhi", "Mumbai", "Chennai", "Pune"] df.loc[~df["City"].isin(valid_cities), "City"] = "Other" print(df["City"])

9. Convert Wrong Data to Missing Data

Sometimes the safest option is to mark wrong data as missing.

Python

df["Marks"] = pd.to_numeric(df["Marks"], errors="coerce") print(df["Marks"])

Wrong values are converted to NaN, which can then be handled using missing-value techniques.

Best Practices for Cleaning Wrong Data

Understand domain rules before cleaning
Use describe() and value_counts() to detect issues
Prefer correcting over deleting when possible
Document all cleaning rules applied
Revalidate data after cleaning

Pandas Cleaning Wrong Data

What Is Wrong Data in Pandas?

Why Cleaning Wrong Data Is Important

Load Sample Data

1. Identify Wrong or Invalid Data

Use Summary Statistics

Check Unique Values

2. Fix Out-of-Range Values

3. Replace Wrong Values

4. Clean Inconsistent Text Data

5. Remove Invalid Characters

6. Validate Data Using Conditions

7. Correct Data Using Business Rules

8. Handle Wrong Categorical Data

9. Convert Wrong Data to Missing Data

Best Practices for Cleaning Wrong Data

Frequently Asked Questions