Machine Learning algorithms work with numerical data. However, real-world datasets often contain categorical features such as:

  • Gender
  • Country
  • Education Level
  • Product Category
  • Department
  • Marital Status

These features contain text labels rather than numbers, making them unsuitable for direct use in most Machine Learning algorithms.

Encoding is the process of converting categorical data into numerical representations that Machine Learning models can understand and process effectively.

Encoding is one of the most important steps in data preprocessing because improper encoding can lead to:

  • Reduced model performance
  • Incorrect relationships between categories
  • Data leakage
  • Biased predictions

In this article, we will explore different encoding techniques, understand when to use each method, and implement practical examples using Python and Scikit-learn.

What is Categorical Data?

Categorical data represents discrete categories or groups rather than numerical measurements.

Examples:

Gender
Male
Female
Male
Female

Another example:

Country
India
USA
Germany
Japan

These values represent categories rather than quantities.

Types of Categorical Variables

Categorical variables are generally divided into two types.

TypeDescription
NominalCategories without order
OrdinalCategories with order

Nominal Variables

Nominal categories have no meaningful order.

Examples:

  • Color
  • Country
  • City
  • Department

Example:

Color
Red
Blue
Green

There is no ranking among these values.

Ordinal Variables

Ordinal categories possess a natural order.

Examples:

  • Education Level
  • Customer Satisfaction
  • Product Rating

Example:

Education
High School
Bachelor's
Master's
PhD

The categories follow a logical sequence.

Why Encoding is Necessary

Machine Learning algorithms perform mathematical operations.

For example:

Distance calculation:

d=i=1n(xiyi)2d=\sqrt{\sum_{i=1}^{n}(x_i-y_i)^2}

Algorithms cannot calculate distances using text values such as:

Male
Female
India
USA

These categories must first be converted into numerical representations.

Example Dataset

GenderAgePurchased
Male25Yes
Female30No
Male28Yes

The Gender column must be encoded before training.

Label Encoding

Label Encoding assigns a unique numerical value to each category.

Example:

GenderEncoded
Male1
Female0

Label Encoding Example

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df["Gender"] = encoder.fit_transform(df["Gender"])

print(df)

Output:

Gender
1
0
1

Advantages of Label Encoding

  • Simple
  • Fast
  • Memory efficient

Disadvantages of Label Encoding

It introduces an artificial order.

Example:

CategoryEncoded
Red0
Blue1
Green2

The model may incorrectly assume:

Green > Blue > Red

which is not true.

When to Use Label Encoding

Suitable for:

  • Binary variables
  • Target labels
  • Tree-based algorithms

Examples:

  • Yes/No
  • Pass/Fail
  • Spam/Not Spam

One-Hot Encoding

One-Hot Encoding is the most commonly used encoding technique for nominal data.

Instead of assigning numbers, it creates separate binary columns.

Example

Original:

Color
Red
Blue
Green

After One-Hot Encoding:

RedBlueGreen
100
010
001

One-Hot Encoding in Python

import pandas as pd

encoded_df = pd.get_dummies(
df,
columns=["Color"]
)

Using Scikit-Learn

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

encoded = encoder.fit_transform(
df[["Color"]]
)

Why One-Hot Encoding Works

Each category becomes independent.

No false ordering is introduced.

Advantages of One-Hot Encoding

  • No ordinal relationship
  • Widely accepted
  • Works well with most algorithms

Disadvantages of One-Hot Encoding

Creates many columns.

Example:

100 countries →

100 new columns.

This problem is known as:

High Dimensionality

Dummy Variable Trap

One-Hot Encoding can create redundant columns.

Example:

MaleFemale
10
01

Knowing Male automatically determines Female.

This creates multicollinearity.

Avoiding Dummy Variable Trap

Drop one column:

pd.get_dummies(
df,
drop_first=True
)

Result:

Male
1
0

Ordinal Encoding

Ordinal Encoding is designed for ordered categories.

Example:

Education
High School
Bachelor's
Master's
PhD

Encoded as:

EducationValue
High School1
Bachelor's2
Master's3
PhD4

Ordinal Encoding in Python

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(
categories=[
[
"High School",
"Bachelor",
"Master",
"PhD"
]
]
)

df["Education"] = encoder.fit_transform(
df[["Education"]]
)

When to Use Ordinal Encoding

Suitable when categories possess natural ordering.

Examples:

  • Ratings
  • Education Levels
  • Satisfaction Scores

Frequency Encoding

Categories are replaced by their frequency.

Example:

City
Delhi
Delhi
Mumbai
Delhi
Mumbai

Frequency table:

CityFrequency
Delhi3
Mumbai2

Encoded:

City
3
3
2
3
2

Frequency Encoding Example

freq = df["City"].value_counts()

df["City"] = df["City"].map(freq)

Advantages

  • Memory efficient
  • Handles large categories

Disadvantages

Different categories may share same frequency.

Target Encoding

Target Encoding uses target variable information.

Example:

Suppose target is Salary.

CityAverage Salary
Delhi50000
Mumbai45000
Chennai60000

Encoding:

CityEncoded
Delhi50000
Mumbai45000
Chennai60000

Why Target Encoding is Powerful

It captures category-target relationships.

Risks of Target Encoding

Can cause:

  • Overfitting
  • Data leakage

Must be applied carefully.

Binary Encoding

Binary Encoding combines Label Encoding and Binary Representation.

Example:

Categories:

City
Delhi
Mumbai
Chennai

Step 1:

CityLabel
Delhi1
Mumbai2
Chennai3

Step 2:

Convert labels to binary:

CityBinary
Delhi001
Mumbai010
Chennai011

Advantages of Binary Encoding

  • Fewer columns than One-Hot Encoding
  • Handles high-cardinality data

High Cardinality Features

High cardinality means:

Many unique categories.

Examples:

  • Product IDs
  • User IDs
  • ZIP Codes

Example:

10000 products

One-Hot Encoding becomes impractical.

Alternative approaches:

  • Frequency Encoding
  • Target Encoding
  • Binary Encoding

Feature Hashing

Feature Hashing maps categories into fixed-size vectors.

Useful for:

  • Large-scale NLP
  • Recommendation Systems

Advantages:

  • Memory efficient
  • Scalable

Disadvantages:

  • Hash collisions

Comparing Encoding Techniques

MethodSuitable ForCreates Extra Columns
Label EncodingBinary CategoriesNo
One-Hot EncodingNominal DataYes
Ordinal EncodingOrdered CategoriesNo
Frequency EncodingHigh CardinalityNo
Target EncodingPredictive CategoriesNo
Binary EncodingLarge CategoriesModerate

Choosing the Right Encoding Method

ScenarioRecommended Encoding
GenderLabel Encoding
CountryOne-Hot Encoding
Education LevelOrdinal Encoding
Product Category (1000+)Frequency Encoding
Large Customer IDsBinary Encoding
Strong Category-Target RelationshipTarget Encoding

Encoding and Machine Learning Algorithms

Different algorithms react differently.

AlgorithmRecommended Encoding
Linear RegressionOne-Hot
Logistic RegressionOne-Hot
SVMOne-Hot
KNNOne-Hot
Neural NetworksOne-Hot
Decision TreesLabel or One-Hot
Random ForestLabel or One-Hot
XGBoostLabel or Target Encoding

Practical Example

Dataset:

GenderCity
MaleDelhi
FemaleMumbai
MaleChennai

Encoding:

import pandas as pd

df = pd.get_dummies(
df,
columns=["City"]
)

print(df)

Output:

GenderCity_DelhiCity_MumbaiCity_Chennai
Male100
Female010
Male001

Common Encoding Mistakes

Using Label Encoding for Nominal Variables

Incorrect:

Country
India = 1
USA = 2
Germany = 3

This creates false ordering.

Applying Target Encoding Before Train-Test Split

This causes data leakage.

Correct workflow:

  1. Split data
  2. Fit encoding on training data
  3. Apply to test data

Creating Too Many One-Hot Columns

Thousands of categories may cause:

  • Memory issues
  • Overfitting
  • Slow training

Consider Binary or Frequency Encoding instead.

Best Practices

  • Identify categorical variable type first
  • Use One-Hot Encoding for nominal features
  • Use Ordinal Encoding only for ordered categories
  • Avoid Label Encoding for non-ordered data
  • Be careful with Target Encoding
  • Handle high-cardinality features separately

Encoding Workflow

A typical workflow is:

  1. Identify categorical columns
  2. Determine nominal vs ordinal
  3. Choose encoding technique
  4. Fit encoder on training data
  5. Transform train and test sets
  6. Train Machine Learning model

Encoding in Modern Machine Learning

Encoding remains a critical preprocessing step in Machine Learning pipelines. While modern Deep Learning models increasingly use embeddings and learned representations, traditional Machine Learning algorithms still rely heavily on proper encoding techniques.

Choosing the right encoding strategy can significantly improve model performance, reduce complexity, and ensure meaningful learning from categorical features.