Information Gain in Machine Learning

Last updated: Jun 13, 2026

Author :

Christy Harshitha Dakarapu

In the previous article, we learned about Entropy, which measures the uncertainty or impurity in a dataset.

We discovered:


High Entropy
      ↓
More Uncertainty

Low Entropy
      ↓
More Certainty

But a Decision Tree faces an important challenge.

Suppose we have multiple features:

Age
Income
Credit Score
Education

Which feature should the tree choose first?

How does the algorithm decide which split is best?

This is where Information Gain comes in.

Information Gain measures how much uncertainty is reduced after splitting the data using a particular feature.

The feature that reduces uncertainty the most is chosen by the Decision Tree.

What is Information Gain?

Information Gain (IG) measures the reduction in entropy achieved by splitting a dataset using a specific feature.

Intuition:


Entropy Before Split
         -
Entropy After Split
         =
Information Gain

A larger Information Gain means:


Better Split

because the split creates more organized groups.

Why Do We Need Information Gain?

Consider a dataset:

Student	Result
1	Pass
2	Fail
3	Pass
4	Fail

Classes are mixed.

Entropy is high.

Now suppose we split using:


Study Hours

Result:

Group 1:


Pass
Pass

Group 2:


Fail
Fail

The uncertainty disappears.

This is an excellent split.

Information Gain quantifies how good that split is.

Intuition Behind Information Gain

Imagine a detective solving a mystery.

Initially:


Many Suspects

High uncertainty.

A new clue appears.

Now:


Only Two Suspects

Uncertainty decreases.

The clue provided information.

Similarly:

A good feature provides information about the target variable.

Information Gain Formula

Formula:

$IG=Entropy(Parent)-Weighted\ Entropy(Children)$

Where:

Parent = Dataset before split
Children = Datasets after split

Information Gain measures how much entropy is removed.

Understanding the Formula

Before splitting:


Mixed Data

After splitting:


Cleaner Groups

If entropy decreases significantly:


High Information Gain

If entropy barely changes:


Low Information Gain

Step-by-Step Example

Dataset:

Student	Result
A	Pass
B	Pass
C	Fail
D	Fail

Step 1: Calculate Parent Entropy

Class Distribution:


Pass = 2

Fail = 2

Probabilities:

P(Pass)=0.5

P(Fail)=0.5

Entropy:

1

Step 2: Split Using Study Hours

Group 1:


Pass
Pass

Group 2:


Fail
Fail

Entropy:

0

for both groups.

Step 3: Calculate Information Gain

IG = 1-0

IG=1

Perfect split.

What Does High Information Gain Mean?

High Information Gain means:


Feature provides useful information.

The feature successfully separates classes.

Decision Trees prefer such features.

What Does Low Information Gain Mean?

Suppose after splitting:

Group 1:


Pass
Fail

Group 2:


Pass
Fail

Both groups remain mixed.

Entropy remains high.

Information Gain becomes very small.

This feature is not useful.

Visual Example

Before Split:


Pass
Fail
Pass
Fail

Entropy:

High.

After Good Split:


Pass
Pass

Fail
Fail

Entropy:

Low.

Information Gain:

High.

Another Example

Suppose a bank wants to predict loan approval.

Features:

Income
Age
Credit Score

Target:


Approved

Rejected

The tree evaluates:


Income

Age

Credit Score

for Information Gain.

The feature producing the largest entropy reduction is selected.

Why Weighted Entropy is Used

Suppose:

Dataset contains:


100 Samples

After splitting:


90 Samples

10 Samples

The larger group should influence the calculation more.

Therefore, child entropies are weighted by their sizes.

Weighted Entropy Formula

$Weighted\ Entropy=\sum\frac{|Child|}{|Parent|}\times Entropy(Child)$

This ensures fair evaluation of splits.

Example

Parent:

100 samples.

Child A:

80 samples.

Entropy:

0.5

Child B:

20 samples.

Entropy:

0.2

Weighted Entropy:

\frac{80}{100}(0.5) + \frac{20}{100}(0.2)

0.44

Information Gain:

ParentEntropy - 0.44

Decision Tree Building Process

Decision Trees repeatedly:


Calculate Entropy
        ↓
Calculate Information Gain
        ↓
Choose Best Feature
        ↓
Split Data
        ↓
Repeat

This process continues recursively.

Information Gain and Root Node Selection

The first split in a Decision Tree is called the:


Root Node

The root feature is chosen using:


Highest Information Gain

because it provides the largest reduction in uncertainty.

Example

Feature Evaluation:

Feature	Information Gain
Age	0.15
Income	0.30
Credit Score	0.55

Decision Tree selects:


Credit Score

because it has the highest Information Gain.

Information Gain and Pure Nodes

The goal of a Decision Tree is to create:


Pure Nodes

Pure nodes contain only one class.

Example:


Pass
Pass
Pass

Entropy:

0

No further splitting is needed.

Information Gain in Multi-Class Problems

Information Gain works for:

Binary Classification
Multi-Class Classification

Example:


Cat
Dog
Horse
Bird

Entropy and Information Gain calculations remain the same.

Advantages of Information Gain

Easy to interpret
Works well for classification
Directly measures uncertainty reduction
Creates meaningful decision tree splits

Limitations of Information Gain

Bias Toward Many Categories

Suppose a feature:


Student ID

Every student has a unique ID.

Splitting by ID creates perfectly pure groups.

Information Gain becomes artificially high.

However:

The feature has no predictive value.

Example

Student ID	Result
101	Pass
102	Fail
103	Pass

Each ID creates a separate node.

This leads to overfitting.

Solution

Algorithms such as:

C4.5

use:


Gain Ratio

to correct this bias.

Real-World Example: Spam Detection

Features:

Number of Links
Sender Reputation
Subject Keywords

The feature reducing entropy the most becomes the splitting criterion.

Real-World Example: Disease Diagnosis

Features:

Age
Blood Pressure
Cholesterol

Decision Trees evaluate Information Gain for each feature and choose the most informative one.

Python Example

Using Scikit-Learn:


from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(
    criterion="entropy"
)

Here:


criterion = "entropy"

means Information Gain is used for split selection.

Information Gain vs Entropy

Concept	Purpose
Entropy	Measures Uncertainty
Information Gain	Measures Reduction in Uncertainty

Relationship:


Entropy
      ↓
Information Gain
      ↓
Best Split Selection

Common Mistakes

Confusing Entropy and Information Gain

Entropy measures impurity.

Information Gain measures impurity reduction.

Assuming High Gain Always Means Good Generalization

Some features may create artificial splits.

Ignoring Overfitting

Decision Trees can continue splitting even when gains become insignificant.

Best Practices

Use Information Gain for classification trees
Monitor tree depth
Avoid overly complex trees
Validate model performance
Consider pruning when necessary

Information Gain Summary

Information Gain Value	Meaning
0	No Improvement
Low	Weak Split
Moderate	Useful Split
High	Strong Split

Decision Tree Split Selection Workflow

Calculate parent entropy
Try each feature
Compute child entropies
Calculate weighted entropy
Calculate Information Gain
Select feature with highest gain
Repeat recursively

Why Information Gain is Important

Information Gain is the engine that drives Decision Trees. It transforms the abstract idea of uncertainty into a measurable quantity and allows the algorithm to systematically choose the most informative features.

By selecting splits that maximize entropy reduction, Information Gain helps Decision Trees build meaningful structures that separate classes effectively and make accurate predictions.

In the next article, we will study the Gini Index, another impurity measure used by Decision Trees and the default splitting criterion in many modern implementations such as CART (Classification and Regression Trees).

Slug

gini-index-in-machine-learning

Meta Title

Understanding Gini Index in Decision Trees

Meta Description

Learn the Gini Index in Machine Learning, how it measures impurity in Decision Trees, comparison with Entropy, split selection, CART algorithm, calculation examples, and practical applications.

In the previous article, we learned how Information Gain uses Entropy to help Decision Trees select the best feature for splitting data.

However, Entropy is not the only way to measure impurity.

Many Decision Tree algorithms use another metric called:

Gini Index

In fact, one of the most widely used Decision Tree algorithms, CART (Classification and Regression Trees), uses the Gini Index by default.

The purpose of the Gini Index is similar to Entropy:


Measure Impurity
Measure Uncertainty
Measure Class Mixing

The lower the impurity, the better the split.

In this article, we will understand the intuition behind the Gini Index, learn how it is calculated, compare it with Entropy, and see how Decision Trees use it to build classification models.

What is the Gini Index?

The Gini Index is a measure of impurity in a dataset.

It tells us:


How Mixed
The Classes Are

A pure dataset has a low Gini Index.

A highly mixed dataset has a high Gini Index.

Intuition Behind the Gini Index

Imagine a bag containing colored balls.

Case 1: Pure Bag


Red
Red
Red
Red
Red

If you randomly pick a ball:

You are always correct in guessing:

Red

Impurity:


Zero

Case 2: Mixed Bag


Red
Blue
Red
Blue
Red

Now there is uncertainty.

Impurity increases.

The Gini Index measures this uncertainty.

Understanding Purity

Pure Dataset:


Pass
Pass
Pass
Pass

Impurity:


Very Low

Mixed Dataset:


Pass
Fail
Pass
Fail

Impurity:


High

Why Decision Trees Need the Gini Index

Decision Trees try to create groups that are as pure as possible.

Example:

Before Split:


Pass
Fail
Pass
Fail

After Split:


Pass
Pass

Fail
Fail

The classes become cleaner.

Impurity decreases.

The Gini Index helps identify such splits.

Gini Index Formula

For a classification problem:

$Gini=1-\sum p_i^2$

Where:

$p_i$ = Probability of class $i$

The formula measures class impurity.

Understanding the Formula

Focus on intuition first.

When one class dominates:


90%
10%

Impurity decreases.

When classes are balanced:


50%
50%

Impurity increases.

Example 1: Pure Dataset

Dataset:


Pass
Pass
Pass
Pass

Probabilities:

P(Pass)=1

Gini:

1-(1)^2

0

Interpretation

Gini:

0

means:

Perfect purity.

Example 2: Balanced Dataset

Dataset:


Pass
Fail
Pass
Fail

Probabilities:

P(Pass)=0.5

P(Fail)=0.5

Gini:

1-(0.5^2+0.5^2)

1-(0.25+0.25)

0.5

Interpretation

Gini:

0.5

Maximum impurity for binary classification.

Example 3: Mostly One Class

Dataset:


Pass
Pass
Pass
Pass
Fail

Probabilities:

0.8

and

0.2

Gini:

1-(0.8^2+0.2^2)

1-(0.64+0.04)

0.32

Lower than 0.5 because the dataset is more pure.

Gini Index Range

For binary classification:

Gini Value	Meaning
0	Completely Pure
0.1	Very Low Impurity
0.3	Moderate Impurity
0.5	Maximum Impurity

Visualizing Impurity


100%-0% → Gini = 0

90%-10% → Gini ≈ 0.18

80%-20% → Gini ≈ 0.32

50%-50% → Gini = 0.5

As classes become balanced, impurity increases.

How Decision Trees Use Gini Index

Suppose we have:


Age
Income
Credit Score

as candidate features.

For each feature:

Perform a split.
Calculate Gini impurity.
Compute weighted impurity.
Select the feature with the lowest impurity.

Weighted Gini Formula

After splitting:

$Weighted\ Gini=\sum\frac{|Child|}{|Parent|}\times Gini(Child)$

The split producing the lowest weighted Gini is selected.

Example

Parent Dataset:

100 samples.

Split:

Child A:

80 samples

Gini:

0.2

Child B:

20 samples

Gini:

0.1

Weighted Gini:

\frac{80}{100}(0.2) + \frac{20}{100}(0.1)

0.18

The Decision Tree compares this value with other possible splits.

Gini Reduction

Just as Entropy uses Information Gain,

Gini uses:


Impurity Reduction

Good splits significantly reduce impurity.

Example

Before Split:


Pass
Fail
Pass
Fail

Gini:

High.

After Split:


Pass
Pass

Fail
Fail

Gini:

Excellent split.

Gini Index vs Entropy

Both measure impurity.

However, they use different formulas.

Entropy Formula

$Entropy=-\sum p_i\log_2(p_i)$

Gini Formula

$Gini=1-\sum p_i^2$

Comparison

Property	Entropy	Gini
Measures Impurity	Yes	Yes
Uses Logarithms	Yes	No
Computational Cost	Higher	Lower
Common in ID3/C4.5	Yes	No
Common in CART	No	Yes

Why CART Uses Gini

Gini is computationally simpler.

No logarithms are required.

This makes training slightly faster.

As a result:

Many modern implementations use:


criterion = "gini"

as the default.

Does Gini Perform Better Than Entropy?

In practice:

Performance differences are usually small.

Most datasets produce very similar trees.

Therefore:

There is rarely a universally better choice.

Example: Loan Approval

Features:

Income
Credit Score
Age

Target:


Approved
Rejected

The Decision Tree evaluates Gini impurity for each split and chooses the feature producing the purest groups.

Example: Spam Detection

Features:

Number of Links
Sender Reputation
Email Length

The split reducing impurity the most becomes the next node.

Multi-Class Gini Index

The Gini Index naturally extends to multiple classes.

Example:


Cat
Dog
Horse
Bird

Formula remains identical.

Simply include all class probabilities.

Python Example

Using Scikit-Learn:


from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(
    criterion="gini"
)

Train:


model.fit(X_train, y_train)

Using Entropy Instead


model = DecisionTreeClassifier(
    criterion="entropy"
)

Both approaches are supported.

Real-World Applications

Medical Diagnosis

Creating disease prediction trees.

Credit Scoring

Loan approval systems.

Customer Churn Prediction

Identifying customers likely to leave.

Fraud Detection

Classifying suspicious transactions.

Marketing

Customer segmentation and targeting.

Common Mistakes

Assuming Gini Measures Accuracy

Gini measures impurity, not prediction accuracy.

Thinking Lower Gini Means Better Model

Lower Gini at a node is good.

Overall model quality still requires validation.

Assuming Entropy and Gini Produce Completely Different Trees

In practice, differences are often minor.

Best Practices

Understand impurity before calculations
Compare Gini and Entropy experimentally
Monitor overfitting
Use pruning when necessary
Validate on unseen data

Gini Index Summary

Dataset Type	Gini
Completely Pure	0
Mostly One Class	Low
Balanced Classes	High
Maximum Binary Impurity	0.5

Decision Tree Splitting Workflow

Calculate Gini impurity
Evaluate possible splits
Compute weighted impurity
Select lowest impurity split
Create child nodes
Repeat recursively

Why the Gini Index is Important

The Gini Index is one of the most widely used impurity measures in Machine Learning and serves as the default splitting criterion in many Decision Tree implementations. It provides a simple yet effective way to measure how mixed a dataset is and helps Decision Trees create increasingly pure groups through recursive splitting.

Understanding the Gini Index is essential because it forms the foundation of CART Decision Trees, Random Forests, and many ensemble learning methods. By measuring impurity efficiently, it enables Decision Trees to build accurate and interpretable classification models.

In the next article, we will study the Decision Tree Algorithm, where we will combine Entropy, Information Gain, and Gini Index to understand how Decision Trees are actually constructed from data and used for prediction.

Information Gain in Machine Learning

What is Information Gain?

Why Do We Need Information Gain?

Intuition Behind Information Gain

Information Gain Formula

Understanding the Formula

Step-by-Step Example

Step 1: Calculate Parent Entropy

Step 2: Split Using Study Hours

Step 3: Calculate Information Gain

What Does High Information Gain Mean?

What Does Low Information Gain Mean?

Visual Example

Another Example

Why Weighted Entropy is Used

Weighted Entropy Formula

Example

Decision Tree Building Process

Information Gain and Root Node Selection

Example

Information Gain and Pure Nodes

Information Gain in Multi-Class Problems

Advantages of Information Gain

Limitations of Information Gain

Bias Toward Many Categories

Example

Solution

Real-World Example: Spam Detection

Real-World Example: Disease Diagnosis

Python Example

Information Gain vs Entropy

Common Mistakes

Confusing Entropy and Information Gain

Assuming High Gain Always Means Good Generalization

Ignoring Overfitting

Best Practices

Information Gain Summary

Decision Tree Split Selection Workflow

Why Information Gain is Important

Gini Index in Machine Learning: Complete Beginner’s Guide

Slug

Meta Title

Meta Description

What is the Gini Index?

Intuition Behind the Gini Index

Case 1: Pure Bag

Case 2: Mixed Bag

Understanding Purity

Why Decision Trees Need the Gini Index

Gini Index Formula

Understanding the Formula

Example 1: Pure Dataset

Interpretation

Example 2: Balanced Dataset

Interpretation

Example 3: Mostly One Class

Gini Index Range

Visualizing Impurity

How Decision Trees Use Gini Index

Weighted Gini Formula

Example

Gini Reduction

Example

Gini Index vs Entropy

Entropy Formula

Gini Formula

Comparison

Why CART Uses Gini

Does Gini Perform Better Than Entropy?

Example: Loan Approval

Example: Spam Detection

Multi-Class Gini Index

Python Example

Using Entropy Instead

Real-World Applications

Medical Diagnosis

Credit Scoring

Customer Churn Prediction

Fraud Detection

Marketing