Encoding Categorical Data in Machine Learning

Last updated: Jun 11, 2026

Author :

Christy Harshitha Dakarapu

Machine Learning algorithms work with numerical data. However, real-world datasets often contain categorical features such as:

Gender
Country
Education Level
Product Category
Department
Marital Status

These features contain text labels rather than numbers, making them unsuitable for direct use in most Machine Learning algorithms.

Encoding is the process of converting categorical data into numerical representations that Machine Learning models can understand and process effectively.

Encoding is one of the most important steps in data preprocessing because improper encoding can lead to:

Reduced model performance
Incorrect relationships between categories
Data leakage
Biased predictions

In this article, we will explore different encoding techniques, understand when to use each method, and implement practical examples using Python and Scikit-learn.

What is Categorical Data?

Categorical data represents discrete categories or groups rather than numerical measurements.

Examples:

Gender
Male
Female
Male
Female

Another example:

Country
India
USA
Germany
Japan

These values represent categories rather than quantities.

Types of Categorical Variables

Categorical variables are generally divided into two types.

Type	Description
Nominal	Categories without order
Ordinal	Categories with order

Nominal Variables

Nominal categories have no meaningful order.

Examples:

Color
Country
City
Department

Example:

Color
Red
Blue
Green

There is no ranking among these values.

Ordinal Variables

Ordinal categories possess a natural order.

Examples:

Education Level
Customer Satisfaction
Product Rating

Example:

Education
High School
Bachelor's
Master's
PhD

The categories follow a logical sequence.

Why Encoding is Necessary

Machine Learning algorithms perform mathematical operations.

For example:

Distance calculation:

$d=\sqrt{\sum_{i=1}^{n}(x_i-y_i)^2}$

Algorithms cannot calculate distances using text values such as:


Male
Female
India
USA

These categories must first be converted into numerical representations.

Example Dataset

Gender	Age	Purchased
Male	25	Yes
Female	30	No
Male	28	Yes

The Gender column must be encoded before training.

Label Encoding

Label Encoding assigns a unique numerical value to each category.

Example:

Gender	Encoded
Male	1
Female	0

Label Encoding Example


from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df["Gender"] = encoder.fit_transform(df["Gender"])

print(df)

Output:

Gender
1
0
1

Advantages of Label Encoding

Simple
Fast
Memory efficient

Disadvantages of Label Encoding

It introduces an artificial order.

Example:

Category	Encoded
Red	0
Blue	1
Green	2

The model may incorrectly assume:

Green > Blue > Red

which is not true.

When to Use Label Encoding

Suitable for:

Binary variables
Target labels
Tree-based algorithms

Examples:

Yes/No
Pass/Fail
Spam/Not Spam

One-Hot Encoding

One-Hot Encoding is the most commonly used encoding technique for nominal data.

Instead of assigning numbers, it creates separate binary columns.

Example

Original:

Color
Red
Blue
Green

After One-Hot Encoding:

Red	Blue	Green
1	0	0
0	1	0
0	0	1

One-Hot Encoding in Python


import pandas as pd

encoded_df = pd.get_dummies(
    df,
    columns=["Color"]
)

Using Scikit-Learn


from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

encoded = encoder.fit_transform(
    df[["Color"]]
)

Why One-Hot Encoding Works

Each category becomes independent.

No false ordering is introduced.

Advantages of One-Hot Encoding

No ordinal relationship
Widely accepted
Works well with most algorithms

Disadvantages of One-Hot Encoding

Creates many columns.

Example:

100 countries →

100 new columns.

This problem is known as:

High Dimensionality

Dummy Variable Trap

One-Hot Encoding can create redundant columns.

Example:

Male	Female
1	0
0	1

Knowing Male automatically determines Female.

This creates multicollinearity.

Avoiding Dummy Variable Trap

Drop one column:


pd.get_dummies(
    df,
    drop_first=True
)

Result:

Male
1
0

Ordinal Encoding

Ordinal Encoding is designed for ordered categories.

Example:

Education
High School
Bachelor's
Master's
PhD

Encoded as:

Education	Value
High School	1
Bachelor's	2
Master's	3
PhD	4

Ordinal Encoding in Python


from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(
    categories=[
        [
            "High School",
            "Bachelor",
            "Master",
            "PhD"
        ]
    ]
)

df["Education"] = encoder.fit_transform(
    df[["Education"]]
)

When to Use Ordinal Encoding

Suitable when categories possess natural ordering.

Examples:

Ratings
Education Levels
Satisfaction Scores

Frequency Encoding

Categories are replaced by their frequency.

Example:

City
Delhi
Delhi
Mumbai
Delhi
Mumbai

Frequency table:

City	Frequency
Delhi	3
Mumbai	2

Encoded:

City
3
3
2
3
2

Frequency Encoding Example


freq = df["City"].value_counts()

df["City"] = df["City"].map(freq)

Advantages

Memory efficient
Handles large categories

Disadvantages

Different categories may share same frequency.

Target Encoding

Target Encoding uses target variable information.

Example:

Suppose target is Salary.

City	Average Salary
Delhi	50000
Mumbai	45000
Chennai	60000

Encoding:

City	Encoded
Delhi	50000
Mumbai	45000
Chennai	60000

Why Target Encoding is Powerful

It captures category-target relationships.

Risks of Target Encoding

Can cause:

Overfitting
Data leakage

Must be applied carefully.

Binary Encoding

Binary Encoding combines Label Encoding and Binary Representation.

Example:

Categories:

City
Delhi
Mumbai
Chennai

Step 1:

City	Label
Delhi	1
Mumbai	2
Chennai	3

Step 2:

Convert labels to binary:

City	Binary
Delhi	001
Mumbai	010
Chennai	011

Advantages of Binary Encoding

Fewer columns than One-Hot Encoding
Handles high-cardinality data

High Cardinality Features

High cardinality means:

Many unique categories.

Examples:

Product IDs
User IDs
ZIP Codes

Example:

10000 products

One-Hot Encoding becomes impractical.

Alternative approaches:

Frequency Encoding
Target Encoding
Binary Encoding

Feature Hashing

Feature Hashing maps categories into fixed-size vectors.

Useful for:

Large-scale NLP
Recommendation Systems

Advantages:

Memory efficient
Scalable

Disadvantages:

Hash collisions

Comparing Encoding Techniques

Method	Suitable For	Creates Extra Columns
Label Encoding	Binary Categories	No
One-Hot Encoding	Nominal Data	Yes
Ordinal Encoding	Ordered Categories	No
Frequency Encoding	High Cardinality	No
Target Encoding	Predictive Categories	No
Binary Encoding	Large Categories	Moderate

Choosing the Right Encoding Method

Scenario	Recommended Encoding
Gender	Label Encoding
Country	One-Hot Encoding
Education Level	Ordinal Encoding
Product Category (1000+)	Frequency Encoding
Large Customer IDs	Binary Encoding
Strong Category-Target Relationship	Target Encoding

Encoding and Machine Learning Algorithms

Different algorithms react differently.

Algorithm	Recommended Encoding
Linear Regression	One-Hot
Logistic Regression	One-Hot
SVM	One-Hot
KNN	One-Hot
Neural Networks	One-Hot
Decision Trees	Label or One-Hot
Random Forest	Label or One-Hot
XGBoost	Label or Target Encoding

Practical Example

Dataset:

Gender	City
Male	Delhi
Female	Mumbai
Male	Chennai

Encoding:


import pandas as pd

df = pd.get_dummies(
    df,
    columns=["City"]
)

print(df)

Output:

Gender	City_Delhi	City_Mumbai	City_Chennai
Male	1	0	0
Female	0	1	0
Male	0	0	1

Common Encoding Mistakes

Using Label Encoding for Nominal Variables

Incorrect:

Country
India = 1
USA = 2
Germany = 3

This creates false ordering.

Applying Target Encoding Before Train-Test Split

This causes data leakage.

Correct workflow:

Split data
Fit encoding on training data
Apply to test data

Creating Too Many One-Hot Columns

Thousands of categories may cause:

Memory issues
Overfitting
Slow training

Consider Binary or Frequency Encoding instead.

Best Practices

Identify categorical variable type first
Use One-Hot Encoding for nominal features
Use Ordinal Encoding only for ordered categories
Avoid Label Encoding for non-ordered data
Be careful with Target Encoding
Handle high-cardinality features separately

Encoding Workflow

A typical workflow is:

Identify categorical columns
Determine nominal vs ordinal
Choose encoding technique
Fit encoder on training data
Transform train and test sets
Train Machine Learning model

Encoding in Modern Machine Learning

Encoding remains a critical preprocessing step in Machine Learning pipelines. While modern Deep Learning models increasingly use embeddings and learned representations, traditional Machine Learning algorithms still rely heavily on proper encoding techniques.

Choosing the right encoding strategy can significantly improve model performance, reduce complexity, and ensure meaningful learning from categorical features.