Machine Learning algorithms work with numerical data. However, real-world datasets often contain categorical features such as:
- Gender
- Country
- Education Level
- Product Category
- Department
- Marital Status
These features contain text labels rather than numbers, making them unsuitable for direct use in most Machine Learning algorithms.
Encoding is the process of converting categorical data into numerical representations that Machine Learning models can understand and process effectively.
Encoding is one of the most important steps in data preprocessing because improper encoding can lead to:
- Reduced model performance
- Incorrect relationships between categories
- Data leakage
- Biased predictions
In this article, we will explore different encoding techniques, understand when to use each method, and implement practical examples using Python and Scikit-learn.
What is Categorical Data?
Categorical data represents discrete categories or groups rather than numerical measurements.
Examples:
| Gender |
|---|
| Male |
| Female |
| Male |
| Female |
Another example:
| Country |
|---|
| India |
| USA |
| Germany |
| Japan |
These values represent categories rather than quantities.
Types of Categorical Variables
Categorical variables are generally divided into two types.
| Type | Description |
|---|---|
| Nominal | Categories without order |
| Ordinal | Categories with order |
Nominal Variables
Nominal categories have no meaningful order.
Examples:
- Color
- Country
- City
- Department
Example:
| Color |
|---|
| Red |
| Blue |
| Green |
There is no ranking among these values.
Ordinal Variables
Ordinal categories possess a natural order.
Examples:
- Education Level
- Customer Satisfaction
- Product Rating
Example:
| Education |
|---|
| High School |
| Bachelor's |
| Master's |
| PhD |
The categories follow a logical sequence.
Why Encoding is Necessary
Machine Learning algorithms perform mathematical operations.
For example:
Distance calculation:
Algorithms cannot calculate distances using text values such as:
Male
Female
India
USA
These categories must first be converted into numerical representations.
Example Dataset
| Gender | Age | Purchased |
|---|---|---|
| Male | 25 | Yes |
| Female | 30 | No |
| Male | 28 | Yes |
The Gender column must be encoded before training.
Label Encoding
Label Encoding assigns a unique numerical value to each category.
Example:
| Gender | Encoded |
|---|---|
| Male | 1 |
| Female | 0 |
Label Encoding Example
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["Gender"] = encoder.fit_transform(df["Gender"])
print(df)
Output:
| Gender |
|---|
| 1 |
| 0 |
| 1 |
Advantages of Label Encoding
- Simple
- Fast
- Memory efficient
Disadvantages of Label Encoding
It introduces an artificial order.
Example:
| Category | Encoded |
|---|---|
| Red | 0 |
| Blue | 1 |
| Green | 2 |
The model may incorrectly assume:
Green > Blue > Red
which is not true.
When to Use Label Encoding
Suitable for:
- Binary variables
- Target labels
- Tree-based algorithms
Examples:
- Yes/No
- Pass/Fail
- Spam/Not Spam
One-Hot Encoding
One-Hot Encoding is the most commonly used encoding technique for nominal data.
Instead of assigning numbers, it creates separate binary columns.
Example
Original:
| Color |
|---|
| Red |
| Blue |
| Green |
After One-Hot Encoding:
| Red | Blue | Green |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
One-Hot Encoding in Python
import pandas as pd
encoded_df = pd.get_dummies(
df,
columns=["Color"]
)
Using Scikit-Learn
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform(
df[["Color"]]
)
Why One-Hot Encoding Works
Each category becomes independent.
No false ordering is introduced.
Advantages of One-Hot Encoding
- No ordinal relationship
- Widely accepted
- Works well with most algorithms
Disadvantages of One-Hot Encoding
Creates many columns.
Example:
100 countries →
100 new columns.
This problem is known as:
High Dimensionality
Dummy Variable Trap
One-Hot Encoding can create redundant columns.
Example:
| Male | Female |
|---|---|
| 1 | 0 |
| 0 | 1 |
Knowing Male automatically determines Female.
This creates multicollinearity.
Avoiding Dummy Variable Trap
Drop one column:
pd.get_dummies(
df,
drop_first=True
)
Result:
| Male |
|---|
| 1 |
| 0 |
Ordinal Encoding
Ordinal Encoding is designed for ordered categories.
Example:
| Education |
|---|
| High School |
| Bachelor's |
| Master's |
| PhD |
Encoded as:
| Education | Value |
|---|---|
| High School | 1 |
| Bachelor's | 2 |
| Master's | 3 |
| PhD | 4 |
Ordinal Encoding in Python
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(
categories=[
[
"High School",
"Bachelor",
"Master",
"PhD"
]
]
)
df["Education"] = encoder.fit_transform(
df[["Education"]]
)
When to Use Ordinal Encoding
Suitable when categories possess natural ordering.
Examples:
- Ratings
- Education Levels
- Satisfaction Scores
Frequency Encoding
Categories are replaced by their frequency.
Example:
| City |
|---|
| Delhi |
| Delhi |
| Mumbai |
| Delhi |
| Mumbai |
Frequency table:
| City | Frequency |
|---|---|
| Delhi | 3 |
| Mumbai | 2 |
Encoded:
| City |
|---|
| 3 |
| 3 |
| 2 |
| 3 |
| 2 |
Frequency Encoding Example
freq = df["City"].value_counts()
df["City"] = df["City"].map(freq)
Advantages
- Memory efficient
- Handles large categories
Disadvantages
Different categories may share same frequency.
Target Encoding
Target Encoding uses target variable information.
Example:
Suppose target is Salary.
| City | Average Salary |
|---|---|
| Delhi | 50000 |
| Mumbai | 45000 |
| Chennai | 60000 |
Encoding:
| City | Encoded |
|---|---|
| Delhi | 50000 |
| Mumbai | 45000 |
| Chennai | 60000 |
Why Target Encoding is Powerful
It captures category-target relationships.
Risks of Target Encoding
Can cause:
- Overfitting
- Data leakage
Must be applied carefully.
Binary Encoding
Binary Encoding combines Label Encoding and Binary Representation.
Example:
Categories:
| City |
|---|
| Delhi |
| Mumbai |
| Chennai |
Step 1:
| City | Label |
|---|---|
| Delhi | 1 |
| Mumbai | 2 |
| Chennai | 3 |
Step 2:
Convert labels to binary:
| City | Binary |
|---|---|
| Delhi | 001 |
| Mumbai | 010 |
| Chennai | 011 |
Advantages of Binary Encoding
- Fewer columns than One-Hot Encoding
- Handles high-cardinality data
High Cardinality Features
High cardinality means:
Many unique categories.
Examples:
- Product IDs
- User IDs
- ZIP Codes
Example:
10000 products
One-Hot Encoding becomes impractical.
Alternative approaches:
- Frequency Encoding
- Target Encoding
- Binary Encoding
Feature Hashing
Feature Hashing maps categories into fixed-size vectors.
Useful for:
- Large-scale NLP
- Recommendation Systems
Advantages:
- Memory efficient
- Scalable
Disadvantages:
- Hash collisions
Comparing Encoding Techniques
| Method | Suitable For | Creates Extra Columns |
|---|---|---|
| Label Encoding | Binary Categories | No |
| One-Hot Encoding | Nominal Data | Yes |
| Ordinal Encoding | Ordered Categories | No |
| Frequency Encoding | High Cardinality | No |
| Target Encoding | Predictive Categories | No |
| Binary Encoding | Large Categories | Moderate |
Choosing the Right Encoding Method
| Scenario | Recommended Encoding |
|---|---|
| Gender | Label Encoding |
| Country | One-Hot Encoding |
| Education Level | Ordinal Encoding |
| Product Category (1000+) | Frequency Encoding |
| Large Customer IDs | Binary Encoding |
| Strong Category-Target Relationship | Target Encoding |
Encoding and Machine Learning Algorithms
Different algorithms react differently.
| Algorithm | Recommended Encoding |
|---|---|
| Linear Regression | One-Hot |
| Logistic Regression | One-Hot |
| SVM | One-Hot |
| KNN | One-Hot |
| Neural Networks | One-Hot |
| Decision Trees | Label or One-Hot |
| Random Forest | Label or One-Hot |
| XGBoost | Label or Target Encoding |
Practical Example
Dataset:
| Gender | City |
|---|---|
| Male | Delhi |
| Female | Mumbai |
| Male | Chennai |
Encoding:
import pandas as pd
df = pd.get_dummies(
df,
columns=["City"]
)
print(df)
Output:
| Gender | City_Delhi | City_Mumbai | City_Chennai |
|---|---|---|---|
| Male | 1 | 0 | 0 |
| Female | 0 | 1 | 0 |
| Male | 0 | 0 | 1 |
Common Encoding Mistakes
Using Label Encoding for Nominal Variables
Incorrect:
| Country |
|---|
| India = 1 |
| USA = 2 |
| Germany = 3 |
This creates false ordering.
Applying Target Encoding Before Train-Test Split
This causes data leakage.
Correct workflow:
- Split data
- Fit encoding on training data
- Apply to test data
Creating Too Many One-Hot Columns
Thousands of categories may cause:
- Memory issues
- Overfitting
- Slow training
Consider Binary or Frequency Encoding instead.
Best Practices
- Identify categorical variable type first
- Use One-Hot Encoding for nominal features
- Use Ordinal Encoding only for ordered categories
- Avoid Label Encoding for non-ordered data
- Be careful with Target Encoding
- Handle high-cardinality features separately
Encoding Workflow
A typical workflow is:
- Identify categorical columns
- Determine nominal vs ordinal
- Choose encoding technique
- Fit encoder on training data
- Transform train and test sets
- Train Machine Learning model
Encoding in Modern Machine Learning
Encoding remains a critical preprocessing step in Machine Learning pipelines. While modern Deep Learning models increasingly use embeddings and learned representations, traditional Machine Learning algorithms still rely heavily on proper encoding techniques.
Choosing the right encoding strategy can significantly improve model performance, reduce complexity, and ensure meaningful learning from categorical features.