Gradient Descent in Machine Learning

Last updated: Jun 12, 2026

Author :

Christy Harshitha Dakarapu

In the previous article, we learned about Cost Functions and how they measure the error made by a Machine Learning model.

We also learned that the goal of training is:

Find the model parameters that minimize the cost function.

A natural question now arises:

How does the model actually find those optimal parameters?

Imagine trying to find the lowest point in a large mountain range while blindfolded.

You cannot see the entire landscape.

You can only determine whether the ground slopes upward or downward at your current location.

By repeatedly moving downhill, you eventually reach the valley.

This is exactly how Gradient Descent works.

Gradient Descent is one of the most important optimization algorithms in Machine Learning. It is used to train:

Linear Regression
Logistic Regression
Neural Networks
Deep Learning Models
Recommendation Systems
Many Advanced ML Algorithms

In this article, we will develop a strong intuition for Gradient Descent, understand how it minimizes cost functions, and learn about different variants used in modern Machine Learning.

What is Gradient Descent?

Gradient Descent is an optimization algorithm used to minimize a cost function by iteratively updating model parameters.

Goal:


Reduce Error
      ↓
Reduce Cost
      ↓
Improve Model

The algorithm repeatedly adjusts model parameters until the cost becomes as small as possible.

Why Do We Need Gradient Descent?

Suppose we have a regression equation:

y=mx+b

Question:

What values of:

m (slope)
b (intercept)

produce the best predictions?

There are infinitely many possible combinations.

Gradient Descent helps us find the optimal values efficiently.

Mountain Analogy

Imagine standing on a mountain.

Goal:

Reach the lowest valley.

Current Position:


       Peak
         *
        / \
       /   \
      /     \
     /       \
    *         \
   You

You cannot see the entire mountain.

However, you can feel which direction slopes downward.

So you:

Take a step downhill.
Check again.
Take another step downhill.

Eventually:

You reach the lowest point.

Gradient Descent follows exactly the same idea.

Understanding the Cost Landscape

Suppose:

Cost depends on parameter values.

Graph:


Cost
 ^
 |
 |       *
 |     *   *
 |   *       *
 | *           *
 +-------------------->
      Parameter

Some parameter values produce high costs.

Others produce lower costs.

The goal is to reach the minimum point.

What is a Gradient?

A gradient measures:

How quickly the cost changes.

Mathematically:

Gradient is the derivative of the cost function.

If:

Gradient is positive

Move left.

If:

Gradient is negative

Move right.

The gradient tells us the direction of steepest increase.

Gradient Descent moves in the opposite direction.

Why Move Opposite to the Gradient?

The gradient points uphill.

We want to go downhill.

Therefore:


Gradient → Uphill

Gradient Descent → Downhill

The Core Idea

Repeat:

Calculate gradient
Move opposite to gradient
Recalculate cost
Repeat

Eventually:

Reach minimum cost.

Cost Function Reminder

For Linear Regression:

J(\theta)

represents the cost.

Goal:

\min J(\theta)

Gradient Descent helps achieve this.

Gradient Descent Update Rule

The fundamental equation is:

$\theta=\theta-\alpha\frac{\partial J}{\partial \theta}$

Where:

$\theta$ = Parameter
$\alpha$ = Learning Rate
$\frac{\partial J}{\partial \theta}$ = Gradient

This equation is the heart of Gradient Descent.

Understanding the Update Rule

The formula says:


New Parameter
=
Old Parameter
-
Step Size
×
Gradient

Every iteration moves the parameter closer to the minimum.

What is Learning Rate?

Learning Rate determines:

How large each step should be.

Symbol:

\alpha

Small Learning Rate

Example:

\alpha=0.0001

Behavior:


Tiny Step
Tiny Step
Tiny Step
Tiny Step

Advantages:

Stable

Disadvantages:

Very slow

Large Learning Rate

Example:

\alpha=10

Behavior:


Huge Step
Huge Step
Huge Step

Advantages:

Faster movement

Disadvantages:

May overshoot the minimum

Visualizing Learning Rate

Too Small:


*--*--*--*--*--*

Very slow progress.

Too Large:


*--------*--------*

May jump over the optimum.

Ideal:


*----*----*----*

Steady convergence.

Convergence

Convergence occurs when:

Parameter updates become very small.

Cost stops decreasing significantly.

Graph:


Cost
 ^
 |
 |\
 | \
 |  \
 |   \______
 |
 +------------>
 Iterations

The curve eventually flattens.

Local Minimum

A local minimum is:

The lowest point within a nearby region.

Example:

Gradient Descent may stop here.

Global Minimum

The global minimum is:

The absolute lowest point of the entire cost function.

Example:


     *     *
    / \   / \
   /   \_/   \
          *

The lowest point overall.

Why Linear Regression is Easier

Linear Regression produces a convex cost function.

Convex Shape:

Characteristics:

Only one minimum
No local minima
Gradient Descent always finds the optimum

Training Process

Machine Learning training:


Initialize Parameters
          ↓
Calculate Predictions
          ↓
Calculate Cost
          ↓
Calculate Gradient
          ↓
Update Parameters
          ↓
Repeat

Example Iterations

Suppose:

Initial Cost:

100

After Iteration 1:

70

After Iteration 2:

40

After Iteration 3:

20

After Iteration 4:

10

Cost decreases steadily.

Batch Gradient Descent

Uses:

Entire dataset

for every update.

Workflow:


All Training Data
       ↓
Compute Gradient
       ↓
Update Parameters

Advantages:

Stable
Accurate gradients

Disadvantages:

Slow on large datasets

Stochastic Gradient Descent (SGD)

Uses:

One training example at a time.

Workflow:


One Sample
     ↓
Gradient
     ↓
Update

Advantages:

Faster
Suitable for large datasets

Disadvantages:

Noisy updates

Mini-Batch Gradient Descent

Most commonly used approach.

Uses:

Small batches

Example:

32 samples

64 samples

128 samples

Workflow:


Mini Batch
      ↓
Gradient
      ↓
Update

Advantages:

Faster
Stable
Efficient

Comparing Gradient Descent Variants

Type	Data Used
Batch GD	Entire Dataset
SGD	One Sample
Mini-Batch GD	Small Batch

Gradient Descent in Python

Using Scikit-Learn:


from sklearn.linear_model import SGDRegressor

model = SGDRegressor()

model.fit(X_train, y_train)

Gradient Descent happens internally.

Why Deep Learning Depends on Gradient Descent

Neural Networks may contain:

Thousands of parameters
Millions of parameters
Billions of parameters

Manually finding optimal values is impossible.

Gradient Descent enables learning at scale.

Challenges with Gradient Descent

Poor Learning Rate

Too Small:

Slow training.

Too Large:

Divergence.

Feature Scaling Issues

Features:

Feature	Range
Age	0-100
Salary	0-1000000

Different scales slow convergence.

Solution:

Feature Scaling.

Cost Surface Complexity

Deep Learning models may have:

Local minima
Saddle points
Complex landscapes

Advanced optimizers help overcome these issues.

Common Optimizers Based on Gradient Descent

Modern Machine Learning often uses:

Gradient Descent
Momentum
RMSProp
Adam
AdaGrad

All build upon the same core idea.

Why Feature Scaling Helps Gradient Descent

Without scaling:


Narrow Zigzag Path

With scaling:


Smooth Direct Path

Training becomes much faster.

Real-World Example

House Price Prediction:

Features:

Area
Bedrooms
Age

Initial Model:

Large prediction errors.

Gradient Descent:

Adjusts coefficients
Reduces cost
Improves predictions

After many iterations:

Optimal coefficients are found.

Common Mistakes

Learning Rate Too High

Training becomes unstable.

Learning Rate Too Low

Training becomes extremely slow.

Not Scaling Features

Gradient Descent converges slowly.

Stopping Too Early

The model may not reach the minimum.

Best Practices

Scale numerical features
Start with reasonable learning rates
Monitor cost during training
Use mini-batch gradient descent for large datasets
Visualize convergence whenever possible

Gradient Descent Workflow

A typical workflow is:

Initialize parameters
Compute predictions
Calculate cost
Compute gradients
Update parameters
Reduce cost
Repeat until convergence

Why Gradient Descent is Important

Gradient Descent is the engine that powers learning in Machine Learning. Cost Functions tell the model how wrong it is, while Gradient Descent tells the model how to improve.

Without Gradient Descent, modern Machine Learning and Deep Learning would be impractical because finding optimal parameters among millions or billions of possibilities would be computationally impossible.

Understanding Gradient Descent is essential because it forms the foundation of optimization, neural networks, deep learning, and many advanced Machine Learning algorithms that power today's AI systems.