Gradient Boosting in Machine Learning

Last updated: Jun 14, 2026

Author :

Christy Harshitha Dakarapu

In the previous article, we learned about AdaBoost, where each new model focuses more on samples that were incorrectly classified by previous models.

AdaBoost was revolutionary, but researchers wanted a more general and powerful framework.

This led to:

Gradient Boosting

Gradient Boosting is one of the most important machine learning algorithms ever developed and forms the foundation of:

XGBoost
LightGBM
CatBoost

Unlike AdaBoost, which focuses on sample weights, Gradient Boosting focuses on:


Prediction Errors

and continuously tries to correct them.

Why Do We Need Gradient Boosting?

Suppose we want to predict house prices.

Actual Prices:


₹50L
₹60L
₹70L

Predictions:


₹45L
₹58L
₹80L

Errors:


+5L
+2L
-10L

The predictions are not perfect.

What if we train another model specifically to predict these errors?

That is the core idea behind Gradient Boosting.

What is Gradient Boosting?

Gradient Boosting is an ensemble learning technique where each new model is trained to correct the errors made by previous models.

Workflow:


Model 1
    ↓
Calculate Errors
    ↓
Model 2 Learns Errors
    ↓
Calculate Remaining Errors
    ↓
Model 3 Learns Remaining Errors

Over time:


Errors Become Smaller

Predictions improve.

Intuition: Learning From Mistakes

Imagine throwing darts at a target.

First Throw:


Misses Center

Second Throw:


Adjust Based on Error

Third Throw:


Adjust Again

Each attempt moves closer to the target.

Gradient Boosting follows the same principle.

Understanding Residuals

Residuals are simply prediction errors.

Formula:

$Residual=Actual-Predicted$

Example

Actual Price:


₹60L

Predicted Price:


₹55L

Residual:

60-55 = 5

Error:

+5

Why Residuals Matter

Residuals tell us:


How Wrong
The Model Is

If we can predict residuals,

we can improve predictions.

Step 1: Train Initial Model

Gradient Boosting begins with a very simple prediction.

For regression:

Usually:


Mean of Target Variable

Example

House Prices:


50
60
70

Mean:

60

Initial Prediction:


60
60
60

for every sample.

Step 2: Calculate Residuals

Actual:


50
60
70

Prediction:


60
60
60

Residuals:


-10
0
10

These residuals become the target for the next model.

Step 3: Train a New Tree

The second tree learns:


Residuals

instead of original labels.

Its job is:


Predict Errors

Step 4: Update Predictions

New Prediction:

$New\ Prediction=Old\ Prediction+Residual\ Prediction$

Predictions become more accurate.

Step 5: Repeat

After updating predictions:


Calculate New Residuals

Train another tree.

Repeat.

Each tree fixes remaining mistakes.

Visualizing Gradient Boosting


Tree 1
   ↓
Residuals
   ↓
Tree 2
   ↓
Residuals
   ↓
Tree 3
   ↓
Residuals
   ↓
Final Model

Why Is It Called Gradient Boosting?

The name comes from:


Gradient Descent

Recall from optimization:

Gradient Descent minimizes loss by moving in the direction of steepest improvement.

Gradient Boosting applies a similar idea.

Each tree moves predictions in the direction that reduces error.

Relationship with Gradient Descent

Gradient Descent:


Update Parameters

Gradient Boosting:


Add New Trees

Both aim to minimize loss.

Learning Rate

Instead of applying corrections completely,

Gradient Boosting often applies only a fraction.

Formula:

$New\ Prediction=Old\ Prediction+\eta\times Residual\ Prediction$

Where:

\eta

is the learning rate.

Why Use a Learning Rate?

Without control:


Large Corrections

may cause overfitting.

Learning rate creates gradual improvement.

Example

Learning Rate:

0.1

Only 10% of the correction is applied.

Training becomes slower but often more accurate.

Why Small Trees Are Used

Gradient Boosting usually uses:


Shallow Decision Trees

Examples:


Depth = 3
Depth = 4
Depth = 5

Each tree learns a small correction.

Classification in Gradient Boosting

Gradient Boosting is not limited to regression.

It also works for:

Spam Detection
Fraud Detection
Disease Prediction

Instead of residuals, it optimizes classification loss functions.

Example: Spam Detection

Tree 1:

Detects obvious spam.

Tree 2:

Corrects mistakes.

Tree 3:

Handles difficult emails.

Combined prediction improves.

Example: Customer Churn

Tree 1:

Predicts likely churners.

Tree 2:

Focuses on missed customers.

Tree 3:

Improves difficult cases.

Gradient Boosting Workflow


Initial Prediction
        ↓
Calculate Residuals
        ↓
Train Tree
        ↓
Update Predictions
        ↓
Calculate New Residuals
        ↓
Repeat

Why Gradient Boosting Works

Each tree specializes in:


Remaining Errors

Rather than repeating the same work.

As a result:


Bias Decreases

and performance improves.

Important Hyperparameters

n_estimators

Number of trees.

Example:


100
200
500

learning_rate

Controls update size.

Example:


0.1
0.05
0.01

max_depth

Controls tree complexity.

Example:


3
4
5

Advantages of Gradient Boosting

High Predictive Accuracy

Often outperforms many traditional algorithms.

Handles Complex Relationships

Captures non-linear patterns.

Flexible

Works for regression and classification.

Feature Importance

Provides feature importance scores.

Strong Foundation

Basis for XGBoost, LightGBM, and CatBoost.

Limitations of Gradient Boosting

Slower Training

Trees are built sequentially.

Sensitive to Hyperparameters

Requires tuning.

Can Overfit

If too many trees are used.

Less Parallelizable

Compared to Random Forest.

Gradient Boosting vs AdaBoost

AdaBoost	Gradient Boosting
Focuses on Sample Weights	Focuses on Residuals
Mainly Classification	Classification & Regression
Simpler	More Flexible
Earlier Algorithm	More General Framework

Gradient Boosting vs Random Forest

Random Forest	Gradient Boosting
Parallel Trees	Sequential Trees
Reduces Variance	Reduces Bias
Faster Training	Slower Training
More Robust	Often More Accurate

Python Implementation

Import:


from sklearn.ensemble import GradientBoostingClassifier

Create Model:


model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3
)

Train:


model.fit(X_train, y_train)

Predict:


predictions = model.predict(X_test)

Feature Importance


print(model.feature_importances_)

Real-World Applications

Healthcare

Disease diagnosis.

Finance

Credit scoring.

Fraud Detection

Suspicious transaction detection.

Marketing

Customer response prediction.

E-Commerce

Purchase prediction.

Recommendation Systems

Personalized recommendations.

Common Mistakes

Using Too Many Trees

May increase overfitting.

Using Large Learning Rates

Can make training unstable.

Ignoring Validation Data

Hyperparameter tuning is critical.

Using Deep Trees

Often leads to overfitting.

Best Practices

Start with shallow trees
Use small learning rates
Monitor validation performance
Tune hyperparameters carefully
Use early stopping when possible

Gradient Boosting Summary

Concept	Purpose
Residuals	Measure Errors
Sequential Trees	Learn Corrections
Learning Rate	Control Updates
Shallow Trees	Incremental Learning
Boosting	Reduce Bias

Gradient Boosting Workflow Summary

Make initial prediction
Calculate residuals
Train tree on residuals
Update predictions
Calculate new residuals
Repeat many times
Combine all trees
Generate final prediction

Why Gradient Boosting is Important

Gradient Boosting transformed boosting from a simple weighting strategy into a powerful optimization framework. By repeatedly fitting models to residual errors and progressively improving predictions, it achieves remarkable performance on structured data.

Understanding Gradient Boosting is essential because it forms the foundation of modern state-of-the-art algorithms such as XGBoost, LightGBM, and CatBoost, which dominate many real-world machine learning applications and competitions.

In the next article, we will study XGBoost (Extreme Gradient Boosting), the algorithm that made Gradient Boosting faster, more scalable, and one of the most successful machine learning techniques ever developed.