What Is Wrong Data in Pandas?
Wrong data refers to values that are incorrect, invalid, inconsistent, or unrealistic, even though they may not be missing. Examples include:
-
Negative age values
-
Salary values beyond a realistic range
-
Misspelled or inconsistent text categories
-
Invalid entries like
"Unknown","NA"in numeric columns -
Data that violates business rules
Cleaning wrong data focuses on correcting or removing invalid values, not just handling missing ones.
Why Cleaning Wrong Data Is Important
Wrong data can:
-
Produce misleading analysis
-
Corrupt statistical results
-
Cause machine learning models to fail
-
Lead to wrong business decisions
Cleaning ensures accuracy, consistency, and trustworthiness of your dataset.
Load Sample Data
import pandas as pd df = pd.read_csv("data.csv") print(df)1. Identify Wrong or Invalid Data
Use Summary Statistics
print(df.describe())This helps detect:
-
Negative values where not expected
-
Extremely large or small values
-
Unusual ranges
Check Unique Values
print(df["Gender"].unique()) # Output: # ['Male' 'F' 'female' 'M' 'Unknown']2. Fix Out-of-Range Values
Example: Age should be between 0 and 120.
df.loc[(df["Age"] < 0) | (df["Age"] > 120), "Age"] = pd.NA print(df) # Output: # Invalid age values replaced with NaN3. Replace Wrong Values
Correct known incorrect values using replace().
df["Gender"] = df["Gender"].replace({ "M": "Male", "F": "Female", "female": "Female", "Unknown": pd.NA }) print(df["Gender"])4. Clean Inconsistent Text Data
Standardize text values.
df["City"] = df["City"].str.strip().str.title() print(df["City"]) # Output: # Consistent city names5. Remove Invalid Characters
Example: Numeric column with symbols.
df["Salary"] = df["Salary"].str.replace("₹", "").str.replace(",", "") df["Salary"] = pd.to_numeric(df["Salary"], errors="coerce") print(df["Salary"])6. Validate Data Using Conditions
Keep only valid rows.
df = df[df["Salary"] > 0] print(df) # Output: # Rows with valid salary values7. Correct Data Using Business Rules
Example: Experience should not exceed age.
df.loc[df["Experience"] > df["Age"], "Experience"] = df["Age"] print(df) 8. Handle Wrong Categorical Data
Replace rare or invalid categories.
valid_cities = ["Delhi", "Mumbai", "Chennai", "Pune"] df.loc[~df["City"].isin(valid_cities), "City"] = "Other" print(df["City"])9. Convert Wrong Data to Missing Data
Sometimes the safest option is to mark wrong data as missing.
df["Marks"] = pd.to_numeric(df["Marks"], errors="coerce") print(df["Marks"])Wrong values are converted to NaN, which can then be handled using missing-value techniques.
Best Practices for Cleaning Wrong Data
-
Understand domain rules before cleaning
-
Use
describe()andvalue_counts()to detect issues -
Prefer correcting over deleting when possible
-
Document all cleaning rules applied
-
Revalidate data after cleaning