Before we understand how Decision Trees choose the best feature to split on, we need to understand a fundamental concept:
Entropy
Entropy is one of the most important concepts in Machine Learning because it measures the amount of uncertainty, disorder, or impurity present in a dataset.
Decision Trees use Entropy to determine which questions provide the most useful information.
For example, consider a dataset for predicting whether a student passes an exam.
Dataset A:
| Student | Result |
|---|---|
| 1 | Pass |
| 2 | Pass |
| 3 | Pass |
| 4 | Pass |
Dataset B:
| Student | Result |
|---|---|
| 1 | Pass |
| 2 | Fail |
| 3 | Pass |
| 4 | Fail |
Which dataset is more uncertain?
Clearly:
Dataset A is completely predictable.
Dataset B contains uncertainty.
Entropy provides a numerical measure of this uncertainty.
What is Entropy?
Entropy measures the amount of randomness or impurity in a dataset.
Intuition:
High Uncertainty
↓
High Entropy
Low Uncertainty
↓
Low Entropy
A pure dataset has low entropy.
A mixed dataset has high entropy.
Understanding Entropy Through Coin Tosses
Suppose we have a coin.
Case 1: Double-Headed Coin
Results:
Head
Head
Head
Head
Outcome is always predictable.
Entropy is very low.
Case 2: Fair Coin
Results:
Head
Tail
Head
Tail
Outcome is uncertain.
Entropy is high.
The more uncertain the outcome, the higher the entropy.
Entropy in Classification
Consider a binary classification dataset.
Pure Dataset
Pass
Pass
Pass
Pass
Pass
Prediction is easy.
Entropy:
Low
Mixed Dataset
Pass
Fail
Pass
Fail
Pass
Prediction is harder.
Entropy:
High
Visualizing Purity
Pure Dataset:
Pass Pass Pass Pass
Entropy:
0
Completely predictable.
Mixed Dataset:
Pass Fail Pass Fail
Entropy:
High
Uncertain.
Why Decision Trees Need Entropy
Decision Trees try to create groups that are as pure as possible.
Example:
Before Split:
Pass
Pass
Fail
Pass
Fail
Mixed classes.
After Split:
Group 1:
Pass
Pass
Pass
Group 2:
Fail
Fail
Much cleaner.
Entropy helps the tree identify such useful splits.
Entropy Formula
The mathematical definition of entropy is:
Where:
- = Probability of each class
- = Logarithm base 2
This formula measures uncertainty.
Understanding the Formula Intuitively
Don't worry about memorizing it immediately.
Focus on these key ideas:
When a class dominates:
One Class ≫ Others
Entropy decreases.
When classes are evenly distributed:
50%
50%
Entropy increases.
Example 1: Pure Dataset
Dataset:
| Class |
|---|
| Pass |
| Pass |
| Pass |
| Pass |
Probabilities:
Entropy:
Interpretation
Entropy:
means:
Perfect certainty.
No randomness.
Example 2: Balanced Dataset
Dataset:
| Class |
|---|
| Pass |
| Fail |
| Pass |
| Fail |
Probabilities:
Entropy:
Interpretation
Entropy:
Maximum uncertainty for binary classification.
Entropy Values
For binary classification:
| Entropy | Meaning |
|---|---|
| 0 | Pure Dataset |
| 0.3 | Low Uncertainty |
| 0.6 | Moderate Uncertainty |
| 1 | Maximum Uncertainty |
Example 3: Mostly One Class
Dataset:
Pass
Pass
Pass
Pass
Fail
Probabilities:
and
Entropy:
This is lower than 1 because the dataset is more predictable.
Entropy Visualization
Class Distribution
100%-0% → Entropy = 0
80%-20% → Entropy ≈ 0.72
50%-50% → Entropy = 1
Entropy increases as class balance approaches equality.
Multi-Class Entropy
Entropy also works for multiple classes.
Example:
Cat
Dog
Horse
Formula remains the same.
We simply include all class probabilities.
Why Entropy is Important
Decision Trees aim to:
Reduce Entropy
Every split should make the data more organized.
Example:
Before Split:
Pass
Fail
Pass
Fail
Entropy:
High.
After Split:
Pass Pass Pass
Fail Fail Fail
Entropy:
Low.
Good split.
Relationship Between Entropy and Purity
High Purity
↓
Low Entropy
Low Purity
↓
High Entropy
This relationship is fundamental to Decision Trees.
Real-World Example: Loan Approval
Dataset:
| Applicant | Result |
|---|---|
| A | Approved |
| B | Rejected |
| C | Approved |
| D | Rejected |
Entropy:
High.
If we split using:
Credit Score
and create cleaner groups:
Entropy decreases.
The split is useful.
Real-World Example: Spam Detection
Dataset:
Spam
Spam
Spam
Not Spam
Spam
Mostly one class.
Entropy is relatively low.
The model already has substantial certainty.
Entropy and Information Theory
Entropy was originally introduced by:
Claude Shannon
in Information Theory.
The same concept later became fundamental in Machine Learning.
Python Example
Calculating entropy manually:
import numpy as np
p1 = 0.5
p2 = 0.5
entropy = -(
p1*np.log2(p1)
+
p2*np.log2(p2)
)
print(entropy)
Output:
1.0
Using SciPy
from scipy.stats import entropy
entropy(
[0.5, 0.5],
base=2
)
Common Misconceptions
High Entropy Means More Data
Incorrect.
Entropy measures uncertainty, not dataset size.
Entropy Only Works for Decision Trees
Incorrect.
Entropy appears throughout:
- Information Theory
- Deep Learning
- Compression Algorithms
- Communication Systems
Entropy is Difficult Mathematics
The formula may look complex, but the intuition is simple:
More Uncertainty
↓
More Entropy
Best Practices
- Focus on intuition before formulas
- Think in terms of uncertainty
- Understand purity and impurity
- Relate entropy to class distributions
- Use visual examples whenever possible
Entropy Summary
| Dataset Type | Entropy |
|---|---|
| Completely Pure | 0 |
| Mostly One Class | Low |
| Balanced Classes | High |
| Maximum Uncertainty | Highest |
Why Entropy is Important
Entropy is the foundation of Decision Trees because it provides a mathematical way to measure uncertainty and impurity in data. By quantifying how mixed a dataset is, entropy helps algorithms determine which splits create the most organized and informative groups.
Understanding entropy is essential because it leads directly to the next concept: Information Gain, the mechanism Decision Trees use to decide which feature should be selected at each split.
In the next article, we will explore Information Gain, where we will learn how Decision Trees use entropy reduction to choose the best possible split in a dataset.