Entropy in Machine Learning

Last updated: Jun 13, 2026

Author :

Christy Harshitha Dakarapu

Before we understand how Decision Trees choose the best feature to split on, we need to understand a fundamental concept:

Entropy

Entropy is one of the most important concepts in Machine Learning because it measures the amount of uncertainty, disorder, or impurity present in a dataset.

Decision Trees use Entropy to determine which questions provide the most useful information.

For example, consider a dataset for predicting whether a student passes an exam.

Dataset A:

Student	Result
1	Pass
2	Pass
3	Pass
4	Pass

Dataset B:

Student	Result
1	Pass
2	Fail
3	Pass
4	Fail

Which dataset is more uncertain?

Clearly:

Dataset A is completely predictable.

Dataset B contains uncertainty.

Entropy provides a numerical measure of this uncertainty.

What is Entropy?

Entropy measures the amount of randomness or impurity in a dataset.

Intuition:


High Uncertainty
      ↓
High Entropy

Low Uncertainty
      ↓
Low Entropy

A pure dataset has low entropy.

A mixed dataset has high entropy.

Understanding Entropy Through Coin Tosses

Suppose we have a coin.

Case 1: Double-Headed Coin

Results:


Head
Head
Head
Head

Outcome is always predictable.

Entropy is very low.

Case 2: Fair Coin

Results:


Head
Tail
Head
Tail

Outcome is uncertain.

Entropy is high.

The more uncertain the outcome, the higher the entropy.

Entropy in Classification

Consider a binary classification dataset.

Pure Dataset


Pass
Pass
Pass
Pass
Pass

Prediction is easy.

Entropy:

Low

Mixed Dataset


Pass
Fail
Pass
Fail
Pass

Prediction is harder.

Entropy:


High

Visualizing Purity

Pure Dataset:


Pass Pass Pass Pass

Entropy:

Completely predictable.

Mixed Dataset:


Pass Fail Pass Fail

Entropy:


High

Uncertain.

Why Decision Trees Need Entropy

Decision Trees try to create groups that are as pure as possible.

Example:

Before Split:


Pass
Pass
Fail
Pass
Fail

Mixed classes.

After Split:


Group 1:
Pass
Pass
Pass

Group 2:
Fail
Fail

Much cleaner.

Entropy helps the tree identify such useful splits.

Entropy Formula

The mathematical definition of entropy is:

$Entropy=-\sum p_i\log_2(p_i)$

Where:

$p_i$ = Probability of each class
$\log_2$ = Logarithm base 2

This formula measures uncertainty.

Understanding the Formula Intuitively

Don't worry about memorizing it immediately.

Focus on these key ideas:

When a class dominates:


One Class ≫ Others

Entropy decreases.

When classes are evenly distributed:


50%
50%

Entropy increases.

Example 1: Pure Dataset

Dataset:

Class
Pass
Pass
Pass
Pass

Probabilities:

P(Pass)=1

Entropy:

-1 \times \log_2(1)

=0

Interpretation

Entropy:

0

means:

Perfect certainty.

No randomness.

Example 2: Balanced Dataset

Dataset:

Class
Pass
Fail
Pass
Fail

Probabilities:

P(Pass)=0.5

P(Fail)=0.5

Entropy:

-(0.5\log_2 0.5+0.5\log_2 0.5)

=1

Interpretation

Entropy:

1

Maximum uncertainty for binary classification.

Entropy Values

For binary classification:

Entropy	Meaning
0	Pure Dataset
0.3	Low Uncertainty
0.6	Moderate Uncertainty
1	Maximum Uncertainty

Example 3: Mostly One Class

Dataset:


Pass
Pass
Pass
Pass
Fail

Probabilities:

0.8

and

0.2

Entropy:

0.72

This is lower than 1 because the dataset is more predictable.

Entropy Visualization


Class Distribution

100%-0%  → Entropy = 0

80%-20%  → Entropy ≈ 0.72

50%-50%  → Entropy = 1

Entropy increases as class balance approaches equality.

Multi-Class Entropy

Entropy also works for multiple classes.

Example:


Cat
Dog
Horse

Formula remains the same.

We simply include all class probabilities.

Why Entropy is Important

Decision Trees aim to:


Reduce Entropy

Every split should make the data more organized.

Example:

Before Split:


Pass
Fail
Pass
Fail

Entropy:

High.

After Split:


Pass Pass Pass

Fail Fail Fail

Entropy:

Low.

Good split.

Relationship Between Entropy and Purity


High Purity
      ↓
Low Entropy

Low Purity
      ↓
High Entropy

This relationship is fundamental to Decision Trees.

Real-World Example: Loan Approval

Dataset:

Applicant	Result
A	Approved
B	Rejected
C	Approved
D	Rejected

Entropy:

High.

If we split using:


Credit Score

and create cleaner groups:

Entropy decreases.

The split is useful.

Real-World Example: Spam Detection

Dataset:


Spam
Spam
Spam
Not Spam
Spam

Mostly one class.

Entropy is relatively low.

The model already has substantial certainty.

Entropy and Information Theory

Entropy was originally introduced by:

Claude Shannon

in Information Theory.

The same concept later became fundamental in Machine Learning.

Python Example

Calculating entropy manually:


import numpy as np

p1 = 0.5
p2 = 0.5

entropy = -(
    p1*np.log2(p1)
    +
    p2*np.log2(p2)
)

print(entropy)

Output:

1.0

Using SciPy


from scipy.stats import entropy

entropy(
    [0.5, 0.5],
    base=2
)

Common Misconceptions

High Entropy Means More Data

Incorrect.

Entropy measures uncertainty, not dataset size.

Entropy Only Works for Decision Trees

Incorrect.

Entropy appears throughout:

Information Theory
Deep Learning
Compression Algorithms
Communication Systems

Entropy is Difficult Mathematics

The formula may look complex, but the intuition is simple:


More Uncertainty
      ↓
More Entropy

Best Practices

Focus on intuition before formulas
Think in terms of uncertainty
Understand purity and impurity
Relate entropy to class distributions
Use visual examples whenever possible

Entropy Summary

Dataset Type	Entropy
Completely Pure	0
Mostly One Class	Low
Balanced Classes	High
Maximum Uncertainty	Highest

Why Entropy is Important

Entropy is the foundation of Decision Trees because it provides a mathematical way to measure uncertainty and impurity in data. By quantifying how mixed a dataset is, entropy helps algorithms determine which splits create the most organized and informative groups.

Understanding entropy is essential because it leads directly to the next concept: Information Gain, the mechanism Decision Trees use to decide which feature should be selected at each split.

In the next article, we will explore Information Gain, where we will learn how Decision Trees use entropy reduction to choose the best possible split in a dataset.