In the previous article, we learned about AdaBoost, where each new model focuses more on samples that were incorrectly classified by previous models.

AdaBoost was revolutionary, but researchers wanted a more general and powerful framework.

This led to:

Gradient Boosting

Gradient Boosting is one of the most important machine learning algorithms ever developed and forms the foundation of:

  • XGBoost
  • LightGBM
  • CatBoost

Unlike AdaBoost, which focuses on sample weights, Gradient Boosting focuses on:

Prediction Errors

and continuously tries to correct them.

Why Do We Need Gradient Boosting?

Suppose we want to predict house prices.

Actual Prices:

₹50L
₹60L
₹70L

Predictions:

₹45L
₹58L
₹80L

Errors:

+5L
+2L
-10L

The predictions are not perfect.

What if we train another model specifically to predict these errors?

That is the core idea behind Gradient Boosting.

What is Gradient Boosting?

Gradient Boosting is an ensemble learning technique where each new model is trained to correct the errors made by previous models.

Workflow:

Model 1

Calculate Errors

Model 2 Learns Errors

Calculate Remaining Errors

Model 3 Learns Remaining Errors

Over time:

Errors Become Smaller

Predictions improve.

Intuition: Learning From Mistakes

Imagine throwing darts at a target.

First Throw:

Misses Center

Second Throw:

Adjust Based on Error

Third Throw:

Adjust Again

Each attempt moves closer to the target.

Gradient Boosting follows the same principle.

Understanding Residuals

Residuals are simply prediction errors.

Formula:

Residual=ActualPredictedResidual=Actual-Predicted

Example

Actual Price:

₹60L

Predicted Price:

₹55L

Residual:

6055=560-55 = 5

Error:

+5

Why Residuals Matter

Residuals tell us:

How Wrong
The Model Is

If we can predict residuals,

we can improve predictions.

Step 1: Train Initial Model

Gradient Boosting begins with a very simple prediction.

For regression:

Usually:

Mean of Target Variable

Example

House Prices:

50
60
70

Mean:

6060

Initial Prediction:

60
60
60

for every sample.

Step 2: Calculate Residuals

Actual:

50
60
70

Prediction:

60
60
60

Residuals:

-10
0
10

These residuals become the target for the next model.

Step 3: Train a New Tree

The second tree learns:

Residuals

instead of original labels.

Its job is:

Predict Errors

Step 4: Update Predictions

New Prediction:

New Prediction=Old Prediction+Residual PredictionNew\ Prediction=Old\ Prediction+Residual\ Prediction

Predictions become more accurate.

Step 5: Repeat

After updating predictions:

Calculate New Residuals

Train another tree.

Repeat.

Each tree fixes remaining mistakes.

Visualizing Gradient Boosting

Tree 1

Residuals

Tree 2

Residuals

Tree 3

Residuals

Final Model

Why Is It Called Gradient Boosting?

The name comes from:

Gradient Descent

Recall from optimization:

Gradient Descent minimizes loss by moving in the direction of steepest improvement.

Gradient Boosting applies a similar idea.

Each tree moves predictions in the direction that reduces error.

Relationship with Gradient Descent

Gradient Descent:

Update Parameters

Gradient Boosting:

Add New Trees

Both aim to minimize loss.

Learning Rate

Instead of applying corrections completely,

Gradient Boosting often applies only a fraction.

Formula:

New Prediction=Old Prediction+η×Residual PredictionNew\ Prediction=Old\ Prediction+\eta\times Residual\ Prediction

Where:

η\eta

is the learning rate.

Why Use a Learning Rate?

Without control:

Large Corrections

may cause overfitting.

Learning rate creates gradual improvement.

Example

Learning Rate:

0.1

Only 10% of the correction is applied.

Training becomes slower but often more accurate.

Why Small Trees Are Used

Gradient Boosting usually uses:

Shallow Decision Trees

Examples:

Depth = 3
Depth = 4
Depth = 5

Each tree learns a small correction.

Classification in Gradient Boosting

Gradient Boosting is not limited to regression.

It also works for:

  • Spam Detection
  • Fraud Detection
  • Disease Prediction

Instead of residuals, it optimizes classification loss functions.

Example: Spam Detection

Tree 1:

Detects obvious spam.

Tree 2:

Corrects mistakes.

Tree 3:

Handles difficult emails.

Combined prediction improves.

Example: Customer Churn

Tree 1:

Predicts likely churners.

Tree 2:

Focuses on missed customers.

Tree 3:

Improves difficult cases.

Gradient Boosting Workflow

Initial Prediction

Calculate Residuals

Train Tree

Update Predictions

Calculate New Residuals

Repeat

Why Gradient Boosting Works

Each tree specializes in:

Remaining Errors

Rather than repeating the same work.

As a result:

Bias Decreases

and performance improves.

Important Hyperparameters

n_estimators

Number of trees.

Example:

100
200
500

learning_rate

Controls update size.

Example:

0.1
0.05
0.01

max_depth

Controls tree complexity.

Example:

3
4
5

Advantages of Gradient Boosting

High Predictive Accuracy

Often outperforms many traditional algorithms.

Handles Complex Relationships

Captures non-linear patterns.

Flexible

Works for regression and classification.

Feature Importance

Provides feature importance scores.

Strong Foundation

Basis for XGBoost, LightGBM, and CatBoost.

Limitations of Gradient Boosting

Slower Training

Trees are built sequentially.

Sensitive to Hyperparameters

Requires tuning.

Can Overfit

If too many trees are used.

Less Parallelizable

Compared to Random Forest.

Gradient Boosting vs AdaBoost

AdaBoostGradient Boosting
Focuses on Sample WeightsFocuses on Residuals
Mainly ClassificationClassification & Regression
SimplerMore Flexible
Earlier AlgorithmMore General Framework

Gradient Boosting vs Random Forest

Random ForestGradient Boosting
Parallel TreesSequential Trees
Reduces VarianceReduces Bias
Faster TrainingSlower Training
More RobustOften More Accurate

Python Implementation

Import:

from sklearn.ensemble import GradientBoostingClassifier

Create Model:

model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3
)

Train:

model.fit(X_train, y_train)

Predict:

predictions = model.predict(X_test)

Feature Importance

print(model.feature_importances_)

Real-World Applications

Healthcare

Disease diagnosis.

Finance

Credit scoring.

Fraud Detection

Suspicious transaction detection.

Marketing

Customer response prediction.

E-Commerce

Purchase prediction.

Recommendation Systems

Personalized recommendations.

Common Mistakes

Using Too Many Trees

May increase overfitting.

Using Large Learning Rates

Can make training unstable.

Ignoring Validation Data

Hyperparameter tuning is critical.

Using Deep Trees

Often leads to overfitting.

Best Practices

  • Start with shallow trees
  • Use small learning rates
  • Monitor validation performance
  • Tune hyperparameters carefully
  • Use early stopping when possible

Gradient Boosting Summary

ConceptPurpose
ResidualsMeasure Errors
Sequential TreesLearn Corrections
Learning RateControl Updates
Shallow TreesIncremental Learning
BoostingReduce Bias

Gradient Boosting Workflow Summary

  1. Make initial prediction
  2. Calculate residuals
  3. Train tree on residuals
  4. Update predictions
  5. Calculate new residuals
  6. Repeat many times
  7. Combine all trees
  8. Generate final prediction

Why Gradient Boosting is Important

Gradient Boosting transformed boosting from a simple weighting strategy into a powerful optimization framework. By repeatedly fitting models to residual errors and progressively improving predictions, it achieves remarkable performance on structured data.

Understanding Gradient Boosting is essential because it forms the foundation of modern state-of-the-art algorithms such as XGBoost, LightGBM, and CatBoost, which dominate many real-world machine learning applications and competitions.

In the next article, we will study XGBoost (Extreme Gradient Boosting), the algorithm that made Gradient Boosting faster, more scalable, and one of the most successful machine learning techniques ever developed.