What Is Wrong Data in Pandas?

Wrong data refers to values that are incorrect, invalid, inconsistent, or unrealistic, even though they may not be missing. Examples include:

  • Negative age values

  • Salary values beyond a realistic range

  • Misspelled or inconsistent text categories

  • Invalid entries like "Unknown", "NA" in numeric columns

  • Data that violates business rules

Cleaning wrong data focuses on correcting or removing invalid values, not just handling missing ones.

Why Cleaning Wrong Data Is Important

Wrong data can:

  • Produce misleading analysis

  • Corrupt statistical results

  • Cause machine learning models to fail

  • Lead to wrong business decisions

Cleaning ensures accuracy, consistency, and trustworthiness of your dataset.

Load Sample Data

Python
import pandas as pd df = pd.read_csv("data.csv") print(df)

1. Identify Wrong or Invalid Data

Use Summary Statistics

Python
print(df.describe())

This helps detect:

  • Negative values where not expected

  • Extremely large or small values

  • Unusual ranges

Check Unique Values

Python
print(df["Gender"].unique()) # Output: # ['Male' 'F' 'female' 'M' 'Unknown']

2. Fix Out-of-Range Values

Example: Age should be between 0 and 120.

Python
df.loc[(df["Age"] < 0) | (df["Age"] > 120), "Age"] = pd.NA print(df) # Output: # Invalid age values replaced with NaN

3. Replace Wrong Values

Correct known incorrect values using replace().

Python
df["Gender"] = df["Gender"].replace({ "M": "Male", "F": "Female", "female": "Female", "Unknown": pd.NA }) print(df["Gender"])

4. Clean Inconsistent Text Data

Standardize text values.

Python
df["City"] = df["City"].str.strip().str.title() print(df["City"]) # Output: # Consistent city names

5. Remove Invalid Characters

Example: Numeric column with symbols.

Python
df["Salary"] = df["Salary"].str.replace("₹", "").str.replace(",", "") df["Salary"] = pd.to_numeric(df["Salary"], errors="coerce") print(df["Salary"])

6. Validate Data Using Conditions

Keep only valid rows.

Python
df = df[df["Salary"] > 0] print(df) # Output: # Rows with valid salary values

7. Correct Data Using Business Rules

Example: Experience should not exceed age.

Python
df.loc[df["Experience"] > df["Age"], "Experience"] = df["Age"] print(df)

8. Handle Wrong Categorical Data

Replace rare or invalid categories.

Python
valid_cities = ["Delhi", "Mumbai", "Chennai", "Pune"] df.loc[~df["City"].isin(valid_cities), "City"] = "Other" print(df["City"])

9. Convert Wrong Data to Missing Data

Sometimes the safest option is to mark wrong data as missing.

Python
df["Marks"] = pd.to_numeric(df["Marks"], errors="coerce") print(df["Marks"])

Wrong values are converted to NaN, which can then be handled using missing-value techniques.

Best Practices for Cleaning Wrong Data

  • Understand domain rules before cleaning

  • Use describe() and value_counts() to detect issues

  • Prefer correcting over deleting when possible

  • Document all cleaning rules applied

  • Revalidate data after cleaning