In the previous article, we learned about Bagging (Bootstrap Aggregating), where multiple models are trained on different bootstrap samples and their predictions are combined.
We also discovered that Decision Trees benefit greatly from Bagging because they tend to have high variance.
This naturally leads to one of the most successful Machine Learning algorithms ever created:
Random Forest
Random Forest builds upon Bagging and introduces an additional idea:
Random Feature Selection
This simple improvement makes Random Forest more diverse, more robust, and often more accurate than a standard bagged collection of Decision Trees.
Today, Random Forest is widely used in:
- Finance
- Healthcare
- Fraud Detection
- Recommendation Systems
- Customer Analytics
because it provides excellent performance with relatively little tuning.
What is Random Forest?
Random Forest is an ensemble learning algorithm that combines multiple Decision Trees and aggregates their predictions.
The name comes from:
Many Decision Trees
↓
Forest
Instead of relying on a single tree:
One Tree
↓
Prediction
Random Forest uses:
Tree 1
Tree 2
Tree 3
Tree 4
...
Tree N
↓
Combined Prediction
Why Not Use One Decision Tree?
Decision Trees have a major weakness:
High Variance
Small changes in data can produce very different trees.
Example:
Dataset A:
Tree A
Dataset B:
Tree B
Predictions may differ significantly.
Random Forest reduces this instability.
Core Idea Behind Random Forest
Random Forest combines:
Bagging
+
Random Feature Selection
This creates a collection of diverse trees.
Diverse trees make different mistakes.
Combining them improves overall performance.
Step 1: Bootstrap Sampling
Random Forest starts by creating multiple bootstrap datasets.
Original Dataset:
A B C D E
Bootstrap Sample 1:
A C C D E
Bootstrap Sample 2:
B B C E D
Bootstrap Sample 3:
A A D E C
Each tree receives a different dataset.
Step 2: Train Multiple Trees
Each bootstrap dataset trains a separate Decision Tree.
Workflow:
Dataset 1 → Tree 1
Dataset 2 → Tree 2
Dataset 3 → Tree 3
So far, this is standard Bagging.
Step 3: Random Feature Selection
This is the key innovation.
Suppose we have:
Age
Salary
Experience
Education
Credit Score
Five features.
When building a split:
A normal Decision Tree considers:
All Features
Random Forest considers:
Random Subset
Example:
Salary
Education
Only these features compete for the split.
Why Random Feature Selection Helps
Without randomness:
Many trees choose the same feature repeatedly.
Example:
Credit Score
becomes the root node in every tree.
Trees become highly similar.
Random feature selection forces diversity.
Example
Tree 1:
Root = Credit Score
Tree 2:
Root = Salary
Tree 3:
Root = Education
Trees become less correlated.
Step 4: Prediction
Each tree generates a prediction.
Example:
Tree 1 → Fraud
Tree 2 → Fraud
Tree 3 → Genuine
Tree 4 → Fraud
Tree 5 → Fraud
Majority Voting
Votes:
Fraud = 4
Genuine = 1
Final Prediction:
Fraud
Regression in Random Forest
For regression:
Predictions are averaged.
Example:
Tree 1 → 45
Tree 2 → 50
Tree 3 → 55
Prediction:
Random Forest Workflow
Original Dataset
↓
Bootstrap Samples
↓
Multiple Trees
↓
Random Features
↓
Predictions
↓
Voting / Averaging
↓
Final Output
Why Random Forest Works
Each tree sees:
- Different data
- Different features
Therefore:
Different Errors
Combining predictions reduces overall error.
Variance Reduction
Single Tree:
High Variance
Random Forest:
Lower Variance
because many trees are averaged together.
Example: Exam Prediction
Single Tree:
Accuracy = 78%
Random Forest:
Accuracy = 88%
Improvement occurs because multiple trees cooperate.
Out-of-Bag (OOB) Samples
Remember:
Bootstrap sampling leaves out some observations.
Example:
Original:
A B C D E
Bootstrap Sample:
A C C D E
Sample:
B
is excluded.
This becomes an:
Out-of-Bag Sample
OOB Evaluation
Out-of-Bag samples can estimate model performance.
Benefits:
- No separate validation set required
- Efficient evaluation
- Built into Random Forest
Feature Importance
One major advantage of Random Forest:
Feature Importance
The algorithm estimates how useful each feature is.
Example:
| Feature | Importance |
|---|---|
| Credit Score | 0.42 |
| Income | 0.30 |
| Age | 0.18 |
| Location | 0.10 |
Higher importance means greater influence.
Classification Example
Predict:
Spam
Not Spam
Trees vote.
Majority class wins.
Regression Example
Predict:
House Price
Trees estimate prices.
Average prediction becomes final output.
Advantages of Random Forest
High Accuracy
Often performs well without extensive tuning.
Reduced Overfitting
More robust than a single Decision Tree.
Handles Non-Linear Relationships
Captures complex patterns.
Feature Importance
Provides useful insights.
Works with Large Datasets
Scales reasonably well.
Handles Missing Values Better
More tolerant than many algorithms.
Limitations of Random Forest
Reduced Interpretability
A forest of hundreds of trees is difficult to explain.
Increased Computational Cost
Training many trees requires more resources.
Larger Memory Usage
Many trees must be stored.
Slower Predictions
Compared to a single tree.
Random Forest vs Decision Tree
| Decision Tree | Random Forest |
|---|---|
| Single Tree | Many Trees |
| High Variance | Lower Variance |
| Easier to Interpret | Harder to Interpret |
| Faster | Slower |
| More Overfitting Risk | Less Overfitting Risk |
Random Forest vs Bagging
| Bagging | Random Forest |
|---|---|
| Bootstrap Sampling | Bootstrap Sampling |
| Multiple Trees | Multiple Trees |
| Uses All Features | Uses Random Features |
| Less Diversity | More Diversity |
Random Forest is essentially an improved version of Bagging.
Choosing Number of Trees
Parameter:
n_estimators
Example:
100 Trees
More trees generally improve stability but increase computation.
Choosing Maximum Depth
Parameter:
max_depth
Controls tree complexity.
Helps prevent overfitting.
Python Implementation
Import:
from sklearn.ensemble import RandomForestClassifier
Create Model:
model = RandomForestClassifier(
n_estimators=100,
random_state=42
)
Train:
model.fit(X_train, y_train)
Predict:
predictions = model.predict(X_test)
Out-of-Bag Evaluation
model = RandomForestClassifier(
n_estimators=100,
oob_score=True
)
View Score:
print(model.oob_score_)
Feature Importance
print(model.feature_importances_)
Real-World Applications
Healthcare
Disease diagnosis.
Finance
Credit scoring.
Fraud Detection
Transaction monitoring.
E-Commerce
Purchase prediction.
Marketing
Customer churn prediction.
Manufacturing
Equipment failure prediction.
Common Mistakes
Using Too Few Trees
Performance may become unstable.
Ignoring Hyperparameter Tuning
Depth and tree count matter.
Assuming Feature Importance Means Causation
Importance indicates usefulness, not causality.
Using Random Forest When Explainability is Critical
Single trees may be preferable.
Best Practices
- Use sufficient trees
- Monitor OOB score
- Tune maximum depth
- Analyze feature importance
- Validate performance on unseen data
Random Forest Summary
| Component | Purpose |
|---|---|
| Bootstrap Sampling | Dataset Diversity |
| Multiple Trees | Reduce Variance |
| Random Features | Tree Diversity |
| Voting | Classification |
| Averaging | Regression |
| OOB Samples | Validation |
Random Forest Workflow Summary
- Create bootstrap samples
- Train multiple trees
- Randomly select features at each split
- Generate predictions
- Vote or average
- Produce final prediction
- Evaluate performance
Why Random Forest is Important
Random Forest is one of the most widely used Machine Learning algorithms because it combines the simplicity of Decision Trees with the power of ensemble learning. By training many diverse trees and aggregating their predictions, it achieves strong performance while reducing overfitting and improving robustness.
Its ability to handle classification and regression tasks, provide feature importance estimates, and perform well with minimal tuning has made it a standard tool in both industry and research.
In the next article, we will study Boosting Intuition, a fundamentally different ensemble technique where models are trained sequentially and each new model focuses on correcting the mistakes made by previous models.