In the previous article, we learned about Entropy, which measures the uncertainty or impurity in a dataset.
We discovered:
High Entropy
↓
More Uncertainty
Low Entropy
↓
More Certainty
But a Decision Tree faces an important challenge.
Suppose we have multiple features:
- Age
- Income
- Credit Score
- Education
Which feature should the tree choose first?
How does the algorithm decide which split is best?
This is where Information Gain comes in.
Information Gain measures how much uncertainty is reduced after splitting the data using a particular feature.
The feature that reduces uncertainty the most is chosen by the Decision Tree.
What is Information Gain?
Information Gain (IG) measures the reduction in entropy achieved by splitting a dataset using a specific feature.
Intuition:
Entropy Before Split
-
Entropy After Split
=
Information Gain
A larger Information Gain means:
Better Split
because the split creates more organized groups.
Why Do We Need Information Gain?
Consider a dataset:
| Student | Result |
|---|---|
| 1 | Pass |
| 2 | Fail |
| 3 | Pass |
| 4 | Fail |
Classes are mixed.
Entropy is high.
Now suppose we split using:
Study Hours
Result:
Group 1:
Pass
Pass
Group 2:
Fail
Fail
The uncertainty disappears.
This is an excellent split.
Information Gain quantifies how good that split is.
Intuition Behind Information Gain
Imagine a detective solving a mystery.
Initially:
Many Suspects
High uncertainty.
A new clue appears.
Now:
Only Two Suspects
Uncertainty decreases.
The clue provided information.
Similarly:
A good feature provides information about the target variable.
Information Gain Formula
Formula:
IG=Entropy(Parent)−Weighted Entropy(Children)
Where:
- Parent = Dataset before split
- Children = Datasets after split
Information Gain measures how much entropy is removed.
Understanding the Formula
Before splitting:
Mixed Data
After splitting:
Cleaner Groups
If entropy decreases significantly:
High Information Gain
If entropy barely changes:
Low Information Gain
Step-by-Step Example
Dataset:
| Student | Result |
|---|---|
| A | Pass |
| B | Pass |
| C | Fail |
| D | Fail |
Step 1: Calculate Parent Entropy
Class Distribution:
Pass = 2
Fail = 2
Probabilities:
P(Pass)=0.5 P(Fail)=0.5Entropy:
1Step 2: Split Using Study Hours
Group 1:
Pass
Pass
Group 2:
Fail
Fail
Entropy:
0for both groups.
Step 3: Calculate Information Gain
IG=1−0 IG=1Perfect split.
What Does High Information Gain Mean?
High Information Gain means:
Feature provides useful information.
The feature successfully separates classes.
Decision Trees prefer such features.
What Does Low Information Gain Mean?
Suppose after splitting:
Group 1:
Pass
Fail
Group 2:
Pass
Fail
Both groups remain mixed.
Entropy remains high.
Information Gain becomes very small.
This feature is not useful.
Visual Example
Before Split:
Pass
Fail
Pass
Fail
Entropy:
High.
After Good Split:
Pass
Pass
Fail
Fail
Entropy:
Low.
Information Gain:
High.
Another Example
Suppose a bank wants to predict loan approval.
Features:
- Income
- Age
- Credit Score
Target:
Approved
Rejected
The tree evaluates:
Income
Age
Credit Score
for Information Gain.
The feature producing the largest entropy reduction is selected.
Why Weighted Entropy is Used
Suppose:
Dataset contains:
100 Samples
After splitting:
90 Samples
10 Samples
The larger group should influence the calculation more.
Therefore, child entropies are weighted by their sizes.
Weighted Entropy Formula
Weighted Entropy=∑∣Parent∣∣Child∣×Entropy(Child)
This ensures fair evaluation of splits.
Example
Parent:
100 samples.
Child A:
80 samples.
Entropy:
0.5
Child B:
20 samples.
Entropy:
0.2
Weighted Entropy:
10080(0.5)+10020(0.2) 0.44Information Gain:
ParentEntropy−0.44Decision Tree Building Process
Decision Trees repeatedly:
Calculate Entropy
↓
Calculate Information Gain
↓
Choose Best Feature
↓
Split Data
↓
Repeat
This process continues recursively.
Information Gain and Root Node Selection
The first split in a Decision Tree is called the:
Root Node
The root feature is chosen using:
Highest Information Gain
because it provides the largest reduction in uncertainty.
Example
Feature Evaluation:
| Feature | Information Gain |
|---|---|
| Age | 0.15 |
| Income | 0.30 |
| Credit Score | 0.55 |
Decision Tree selects:
Credit Score
because it has the highest Information Gain.
Information Gain and Pure Nodes
The goal of a Decision Tree is to create:
Pure Nodes
Pure nodes contain only one class.
Example:
Pass
Pass
Pass
Entropy:
0No further splitting is needed.
Information Gain in Multi-Class Problems
Information Gain works for:
- Binary Classification
- Multi-Class Classification
Example:
Cat
Dog
Horse
Bird
Entropy and Information Gain calculations remain the same.
Advantages of Information Gain
- Easy to interpret
- Works well for classification
- Directly measures uncertainty reduction
- Creates meaningful decision tree splits
Limitations of Information Gain
Bias Toward Many Categories
Suppose a feature:
Student ID
Every student has a unique ID.
Splitting by ID creates perfectly pure groups.
Information Gain becomes artificially high.
However:
The feature has no predictive value.
Example
| Student ID | Result |
|---|---|
| 101 | Pass |
| 102 | Fail |
| 103 | Pass |
Each ID creates a separate node.
This leads to overfitting.
Solution
Algorithms such as:
- C4.5
use:
Gain Ratio
to correct this bias.
Real-World Example: Spam Detection
Features:
- Number of Links
- Sender Reputation
- Subject Keywords
The feature reducing entropy the most becomes the splitting criterion.
Real-World Example: Disease Diagnosis
Features:
- Age
- Blood Pressure
- Cholesterol
Decision Trees evaluate Information Gain for each feature and choose the most informative one.
Python Example
Using Scikit-Learn:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(
criterion="entropy"
)
Here:
criterion = "entropy"
means Information Gain is used for split selection.
Information Gain vs Entropy
| Concept | Purpose |
|---|---|
| Entropy | Measures Uncertainty |
| Information Gain | Measures Reduction in Uncertainty |
Relationship:
Entropy
↓
Information Gain
↓
Best Split Selection
Common Mistakes
Confusing Entropy and Information Gain
Entropy measures impurity.
Information Gain measures impurity reduction.
Assuming High Gain Always Means Good Generalization
Some features may create artificial splits.
Ignoring Overfitting
Decision Trees can continue splitting even when gains become insignificant.
Best Practices
- Use Information Gain for classification trees
- Monitor tree depth
- Avoid overly complex trees
- Validate model performance
- Consider pruning when necessary
Information Gain Summary
| Information Gain Value | Meaning |
|---|---|
| 0 | No Improvement |
| Low | Weak Split |
| Moderate | Useful Split |
| High | Strong Split |
Decision Tree Split Selection Workflow
- Calculate parent entropy
- Try each feature
- Compute child entropies
- Calculate weighted entropy
- Calculate Information Gain
- Select feature with highest gain
- Repeat recursively
Why Information Gain is Important
Information Gain is the engine that drives Decision Trees. It transforms the abstract idea of uncertainty into a measurable quantity and allows the algorithm to systematically choose the most informative features.
By selecting splits that maximize entropy reduction, Information Gain helps Decision Trees build meaningful structures that separate classes effectively and make accurate predictions.
In the next article, we will study the Gini Index, another impurity measure used by Decision Trees and the default splitting criterion in many modern implementations such as CART (Classification and Regression Trees).
Gini Index in Machine Learning: Complete Beginner’s Guide
Slug
gini-index-in-machine-learning
Meta Title
Understanding Gini Index in Decision Trees
Meta Description
Learn the Gini Index in Machine Learning, how it measures impurity in Decision Trees, comparison with Entropy, split selection, CART algorithm, calculation examples, and practical applications.
In the previous article, we learned how Information Gain uses Entropy to help Decision Trees select the best feature for splitting data.
However, Entropy is not the only way to measure impurity.
Many Decision Tree algorithms use another metric called:
Gini Index
In fact, one of the most widely used Decision Tree algorithms, CART (Classification and Regression Trees), uses the Gini Index by default.
The purpose of the Gini Index is similar to Entropy:
Measure Impurity
Measure Uncertainty
Measure Class Mixing
The lower the impurity, the better the split.
In this article, we will understand the intuition behind the Gini Index, learn how it is calculated, compare it with Entropy, and see how Decision Trees use it to build classification models.
What is the Gini Index?
The Gini Index is a measure of impurity in a dataset.
It tells us:
How Mixed
The Classes Are
A pure dataset has a low Gini Index.
A highly mixed dataset has a high Gini Index.
Intuition Behind the Gini Index
Imagine a bag containing colored balls.
Case 1: Pure Bag
Red
Red
Red
Red
Red
If you randomly pick a ball:
You are always correct in guessing:
Red
Impurity:
Zero
Case 2: Mixed Bag
Red
Blue
Red
Blue
Red
Now there is uncertainty.
Impurity increases.
The Gini Index measures this uncertainty.
Understanding Purity
Pure Dataset:
Pass
Pass
Pass
Pass
Impurity:
Very Low
Mixed Dataset:
Pass
Fail
Pass
Fail
Impurity:
High
Why Decision Trees Need the Gini Index
Decision Trees try to create groups that are as pure as possible.
Example:
Before Split:
Pass
Fail
Pass
Fail
After Split:
Pass
Pass
Fail
Fail
The classes become cleaner.
Impurity decreases.
The Gini Index helps identify such splits.
Gini Index Formula
For a classification problem:
Gini=1−∑pi2
Where:
- pi = Probability of class i
The formula measures class impurity.
Understanding the Formula
Focus on intuition first.
When one class dominates:
90%
10%
Impurity decreases.
When classes are balanced:
50%
50%
Impurity increases.
Example 1: Pure Dataset
Dataset:
Pass
Pass
Pass
Pass
Probabilities:
P(Pass)=1Gini:
1−(1)2 0Interpretation
Gini:
0means:
Perfect purity.
Example 2: Balanced Dataset
Dataset:
Pass
Fail
Pass
Fail
Probabilities:
P(Pass)=0.5 P(Fail)=0.5Gini:
1−(0.52+0.52) 1−(0.25+0.25) 0.5Interpretation
Gini:
0.5Maximum impurity for binary classification.
Example 3: Mostly One Class
Dataset:
Pass
Pass
Pass
Pass
Fail
Probabilities:
0.8and
0.2Gini:
1−(0.82+0.22) 1−(0.64+0.04) 0.32Lower than 0.5 because the dataset is more pure.
Gini Index Range
For binary classification:
| Gini Value | Meaning |
|---|---|
| 0 | Completely Pure |
| 0.1 | Very Low Impurity |
| 0.3 | Moderate Impurity |
| 0.5 | Maximum Impurity |
Visualizing Impurity
100%-0% → Gini = 0
90%-10% → Gini ≈ 0.18
80%-20% → Gini ≈ 0.32
50%-50% → Gini = 0.5
As classes become balanced, impurity increases.
How Decision Trees Use Gini Index
Suppose we have:
Age
Income
Credit Score
as candidate features.
For each feature:
- Perform a split.
- Calculate Gini impurity.
- Compute weighted impurity.
- Select the feature with the lowest impurity.
Weighted Gini Formula
After splitting:
Weighted Gini=∑∣Parent∣∣Child∣×Gini(Child)
The split producing the lowest weighted Gini is selected.
Example
Parent Dataset:
100 samples.
Split:
Child A:
80 samples
Gini:
0.2
Child B:
20 samples
Gini:
0.1
Weighted Gini:
10080(0.2)+10020(0.1) 0.18The Decision Tree compares this value with other possible splits.
Gini Reduction
Just as Entropy uses Information Gain,
Gini uses:
Impurity Reduction
Good splits significantly reduce impurity.
Example
Before Split:
Pass
Fail
Pass
Fail
Gini:
High.
After Split:
Pass
Pass
Fail
Fail
Gini:
0
Excellent split.
Gini Index vs Entropy
Both measure impurity.
However, they use different formulas.
Entropy Formula
Entropy=−∑pilog2(pi)
Gini Formula
Gini=1−∑pi2
Comparison
| Property | Entropy | Gini |
|---|---|---|
| Measures Impurity | Yes | Yes |
| Uses Logarithms | Yes | No |
| Computational Cost | Higher | Lower |
| Common in ID3/C4.5 | Yes | No |
| Common in CART | No | Yes |
Why CART Uses Gini
Gini is computationally simpler.
No logarithms are required.
This makes training slightly faster.
As a result:
Many modern implementations use:
criterion = "gini"
as the default.
Does Gini Perform Better Than Entropy?
In practice:
Performance differences are usually small.
Most datasets produce very similar trees.
Therefore:
There is rarely a universally better choice.
Example: Loan Approval
Features:
- Income
- Credit Score
- Age
Target:
Approved
Rejected
The Decision Tree evaluates Gini impurity for each split and chooses the feature producing the purest groups.
Example: Spam Detection
Features:
- Number of Links
- Sender Reputation
- Email Length
The split reducing impurity the most becomes the next node.
Multi-Class Gini Index
The Gini Index naturally extends to multiple classes.
Example:
Cat
Dog
Horse
Bird
Formula remains identical.
Simply include all class probabilities.
Python Example
Using Scikit-Learn:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(
criterion="gini"
)
Train:
model.fit(X_train, y_train)
Using Entropy Instead
model = DecisionTreeClassifier(
criterion="entropy"
)
Both approaches are supported.
Real-World Applications
Medical Diagnosis
Creating disease prediction trees.
Credit Scoring
Loan approval systems.
Customer Churn Prediction
Identifying customers likely to leave.
Fraud Detection
Classifying suspicious transactions.
Marketing
Customer segmentation and targeting.
Common Mistakes
Assuming Gini Measures Accuracy
Gini measures impurity, not prediction accuracy.
Thinking Lower Gini Means Better Model
Lower Gini at a node is good.
Overall model quality still requires validation.
Assuming Entropy and Gini Produce Completely Different Trees
In practice, differences are often minor.
Best Practices
- Understand impurity before calculations
- Compare Gini and Entropy experimentally
- Monitor overfitting
- Use pruning when necessary
- Validate on unseen data
Gini Index Summary
| Dataset Type | Gini |
|---|---|
| Completely Pure | 0 |
| Mostly One Class | Low |
| Balanced Classes | High |
| Maximum Binary Impurity | 0.5 |
Decision Tree Splitting Workflow
- Calculate Gini impurity
- Evaluate possible splits
- Compute weighted impurity
- Select lowest impurity split
- Create child nodes
- Repeat recursively
Why the Gini Index is Important
The Gini Index is one of the most widely used impurity measures in Machine Learning and serves as the default splitting criterion in many Decision Tree implementations. It provides a simple yet effective way to measure how mixed a dataset is and helps Decision Trees create increasingly pure groups through recursive splitting.
Understanding the Gini Index is essential because it forms the foundation of CART Decision Trees, Random Forests, and many ensemble learning methods. By measuring impurity efficiently, it enables Decision Trees to build accurate and interpretable classification models.
In the next article, we will study the Decision Tree Algorithm, where we will combine Entropy, Information Gain, and Gini Index to understand how Decision Trees are actually constructed from data and used for prediction.