Before we understand how Decision Trees choose the best feature to split on, we need to understand a fundamental concept:

Entropy

Entropy is one of the most important concepts in Machine Learning because it measures the amount of uncertainty, disorder, or impurity present in a dataset.

Decision Trees use Entropy to determine which questions provide the most useful information.

For example, consider a dataset for predicting whether a student passes an exam.

Dataset A:

StudentResult
1Pass
2Pass
3Pass
4Pass

Dataset B:

StudentResult
1Pass
2Fail
3Pass
4Fail

Which dataset is more uncertain?

Clearly:

Dataset A is completely predictable.

Dataset B contains uncertainty.

Entropy provides a numerical measure of this uncertainty.

What is Entropy?

Entropy measures the amount of randomness or impurity in a dataset.

Intuition:

High Uncertainty

High Entropy

Low Uncertainty

Low Entropy

A pure dataset has low entropy.

A mixed dataset has high entropy.

Understanding Entropy Through Coin Tosses

Suppose we have a coin.

Case 1: Double-Headed Coin

Results:

Head
Head
Head
Head

Outcome is always predictable.

Entropy is very low.

Case 2: Fair Coin

Results:

Head
Tail
Head
Tail

Outcome is uncertain.

Entropy is high.

The more uncertain the outcome, the higher the entropy.

Entropy in Classification

Consider a binary classification dataset.

Pure Dataset

Pass
Pass
Pass
Pass
Pass

Prediction is easy.

Entropy:

Low

Mixed Dataset

Pass
Fail
Pass
Fail
Pass

Prediction is harder.

Entropy:

High

Visualizing Purity

Pure Dataset:

Pass Pass Pass Pass

Entropy:

0

Completely predictable.

Mixed Dataset:

Pass Fail Pass Fail

Entropy:

High

Uncertain.

Why Decision Trees Need Entropy

Decision Trees try to create groups that are as pure as possible.

Example:

Before Split:

Pass
Pass
Fail
Pass
Fail

Mixed classes.

After Split:

Group 1:
Pass
Pass
Pass

Group 2:
Fail
Fail

Much cleaner.

Entropy helps the tree identify such useful splits.

Entropy Formula

The mathematical definition of entropy is:

Entropy=pilog2(pi)Entropy=-\sum p_i\log_2(p_i)

Where:

  • pip_i = Probability of each class
  • log2\log_2 = Logarithm base 2

This formula measures uncertainty.

Understanding the Formula Intuitively

Don't worry about memorizing it immediately.

Focus on these key ideas:

When a class dominates:

One Class ≫ Others

Entropy decreases.

When classes are evenly distributed:

50%
50%

Entropy increases.

Example 1: Pure Dataset

Dataset:

Class
Pass
Pass
Pass
Pass

Probabilities:

P(Pass)=1P(Pass)=1

Entropy:

1×log2(1)-1 \times \log_2(1) =0=0

Interpretation

Entropy:

00

means:

Perfect certainty.

No randomness.

Example 2: Balanced Dataset

Dataset:

Class
Pass
Fail
Pass
Fail

Probabilities:

P(Pass)=0.5P(Pass)=0.5 P(Fail)=0.5P(Fail)=0.5

Entropy:

(0.5log20.5+0.5log20.5)-(0.5\log_2 0.5+0.5\log_2 0.5) =1=1

Interpretation

Entropy:

11

Maximum uncertainty for binary classification.

Entropy Values

For binary classification:

EntropyMeaning
0Pure Dataset
0.3Low Uncertainty
0.6Moderate Uncertainty
1Maximum Uncertainty

Example 3: Mostly One Class

Dataset:

Pass
Pass
Pass
Pass
Fail

Probabilities:

0.80.8

and

0.20.2

Entropy:

0.720.72

This is lower than 1 because the dataset is more predictable.

Entropy Visualization

Class Distribution

100%-0% → Entropy = 0

80%-20% → Entropy ≈ 0.72

50%-50% → Entropy = 1

Entropy increases as class balance approaches equality.

Multi-Class Entropy

Entropy also works for multiple classes.

Example:

Cat
Dog
Horse

Formula remains the same.

We simply include all class probabilities.

Why Entropy is Important

Decision Trees aim to:

Reduce Entropy

Every split should make the data more organized.

Example:

Before Split:

Pass
Fail
Pass
Fail

Entropy:

High.

After Split:

Pass Pass Pass

Fail Fail Fail

Entropy:

Low.

Good split.

Relationship Between Entropy and Purity

High Purity

Low Entropy

Low Purity

High Entropy

This relationship is fundamental to Decision Trees.

Real-World Example: Loan Approval

Dataset:

ApplicantResult
AApproved
BRejected
CApproved
DRejected

Entropy:

High.

If we split using:

Credit Score

and create cleaner groups:

Entropy decreases.

The split is useful.

Real-World Example: Spam Detection

Dataset:

Spam
Spam
Spam
Not Spam
Spam

Mostly one class.

Entropy is relatively low.

The model already has substantial certainty.

Entropy and Information Theory

Entropy was originally introduced by:

Claude Shannon

in Information Theory.

The same concept later became fundamental in Machine Learning.

Python Example

Calculating entropy manually:

import numpy as np

p1 = 0.5
p2 = 0.5

entropy = -(
p1*np.log2(p1)
+
p2*np.log2(p2)
)

print(entropy)

Output:

1.0

Using SciPy

from scipy.stats import entropy

entropy(
[0.5, 0.5],
base=2
)

Common Misconceptions

High Entropy Means More Data

Incorrect.

Entropy measures uncertainty, not dataset size.

Entropy Only Works for Decision Trees

Incorrect.

Entropy appears throughout:

  • Information Theory
  • Deep Learning
  • Compression Algorithms
  • Communication Systems

Entropy is Difficult Mathematics

The formula may look complex, but the intuition is simple:

More Uncertainty

More Entropy

Best Practices

  • Focus on intuition before formulas
  • Think in terms of uncertainty
  • Understand purity and impurity
  • Relate entropy to class distributions
  • Use visual examples whenever possible

Entropy Summary

Dataset TypeEntropy
Completely Pure0
Mostly One ClassLow
Balanced ClassesHigh
Maximum UncertaintyHighest

Why Entropy is Important

Entropy is the foundation of Decision Trees because it provides a mathematical way to measure uncertainty and impurity in data. By quantifying how mixed a dataset is, entropy helps algorithms determine which splits create the most organized and informative groups.

Understanding entropy is essential because it leads directly to the next concept: Information Gain, the mechanism Decision Trees use to decide which feature should be selected at each split.

In the next article, we will explore Information Gain, where we will learn how Decision Trees use entropy reduction to choose the best possible split in a dataset.