Cost Function in Machine Learning

Last updated: Jun 12, 2026

Author :

Christy Harshitha Dakarapu

In the previous articles, we learned:

What Regression is
How Simple Linear Regression works
How Multiple Linear Regression makes predictions

A natural question arises:

How does the model know whether its predictions are good or bad?

Suppose two different regression models make predictions.

Model A:

Actual	Predicted
50	49
70	72
100	98

Model B:

Actual	Predicted
50	20
70	130
100	60

Clearly, Model A is better.

But how can a computer quantify this difference mathematically?

The answer is:

Cost Function

A Cost Function measures how wrong a model's predictions are. It provides a numerical value representing the overall prediction error.

Machine Learning models learn by minimizing this cost.

In this article, we will understand Cost Functions intuitively, learn why they are necessary, explore popular regression cost functions, and understand their role in model training.

What is a Cost Function?

A Cost Function is a mathematical function that measures the difference between:

Actual Values
Predicted Values

It tells us:

How bad the model is performing.

Lower Cost:

Better Model

Higher Cost:

Worse Model

Why Do We Need a Cost Function?

Consider two models:

Model A:

Actual	Predicted
100	99
200	201

Model B:

Actual	Predicted
100	50
200	300

Which model is better?

Humans can easily tell.

Computers need a numerical measure.

The Cost Function provides that measure.

Understanding Prediction Error

Suppose:

Actual House Price:

100

Predicted House Price:

90

Error:

10

Prediction Error measures how far the prediction is from reality.

Residual

The difference between actual and predicted values is called a residual.

Formula:

$Residual=y_{actual}-y_{predicted}$

Example:

Actual:

80

Predicted:

70

Residual:

10

Multiple Predictions

Suppose:

Actual	Predicted
50	45
70	72
100	95

Residuals:

Actual	Predicted	Residual
50	45	5
70	72	-2
100	95	5

Now we need a way to combine all residuals into a single number.

Why We Cannot Simply Add Errors

Consider:

Residuals:

5

and

-5

Sum:

0

This incorrectly suggests perfect predictions.

Positive and negative errors cancel each other.

We need a better solution.

Squaring the Errors

One solution:

Square every residual.

Example:

Error	Squared Error
5	25
-5	25

Now all values become positive.

This prevents cancellation.

Mean Squared Error (MSE)

The most common regression cost function is:

Mean Squared Error (MSE)

Formula:

$MSE=\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat y_i)^2$

Where:

$y_i$ = Actual Value
$\hat y_i$ = Predicted Value
$n$ = Number of observations

Understanding MSE Step-by-Step

Dataset:

Actual	Predicted
10	8
20	18
30	35

Step 1:

Calculate Errors

Error
2
2
-5

Step 2:

Square Errors

Squared Error
4
4
25

Step 3:

Take Mean

MSE= \frac{4+4+25}{3}

MSE=11

Why MSE is Popular

Advantages:

Always positive
Easy to calculate
Smooth mathematical properties
Penalizes large errors heavily

This makes optimization easier.

Understanding Error Penalty

Consider:

Error:

2

Squared Error:

4

Error:

10

Squared Error:

100

Large mistakes receive much larger penalties.

This encourages the model to avoid big prediction errors.

Visualizing Cost

Suppose:

Model A:

MSE = 5

Model B:

MSE = 20

Interpretation:

Model A performs better.

The lower the cost, the better the model.

Cost Function and the Best Fit Line

Remember:

Linear Regression tries to find the best-fit line.

Question:

What defines "best"?

Answer:

The line with the lowest cost.

Among thousands of possible lines, the model chooses the one producing the smallest MSE.

Example

Line 1:

y=2x+1

Cost:

50

Line 2:

y=1.8x+2

Cost:

10

Line 2 is better because its cost is lower.

Cost Function Visualization

Imagine:


Cost
 ^
 |
 |      *
 |    *
 |  *
 |*
 +------------------->
     Model Parameters

The goal is to reach the lowest point.

This point corresponds to the optimal model.

Loss Function vs Cost Function

These terms are often confused.

Loss Function

Measures error for a single training example.

Example:

One house prediction.

Cost Function

Measures average error across the entire dataset.

Example:

Thousands of house predictions.

Relationship:


Individual Losses
        ↓
Average
        ↓
Cost Function

Mean Absolute Error (MAE)

Another popular cost function is MAE.

Formula:

$MAE=\frac{1}{n}\sum |y_i-\hat y_i|$

Instead of squaring errors, MAE uses absolute values.

Example

Errors:

2, -3, 4

Absolute Errors:

2,3,4

MAE:

\frac{2+3+4}{3} = 3

MSE vs MAE

MSE	MAE
Squares errors	Uses absolute values
Punishes large errors heavily	Treats errors equally
Smooth optimization	More robust to outliers

Root Mean Squared Error (RMSE)

RMSE is the square root of MSE.

Formula:

$RMSE=\sqrt{MSE}$

Example:

MSE=25

RMSE:

5

Why RMSE is Useful

MSE units:

Price^2

RMSE units:

Price

This makes interpretation easier.

Cost Function in Linear Regression

The cost function for Linear Regression is:

$J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2$

Where:

$J(\theta)$ = Cost
$m$ = Number of samples
$h_\theta(x)$ = Prediction

This is the function the model minimizes during training.

Why the Lowest Cost Matters

Suppose:

Three candidate models:

Model	Cost
A	100
B	50
C	10

Best Model:

Model C

Because:

10 < 50 < 100

Optimization Goal

Machine Learning training is essentially:


Start with Random Parameters
            ↓
Calculate Cost
            ↓
Adjust Parameters
            ↓
Reduce Cost
            ↓
Repeat

This process continues until the cost is minimized.

Enter Gradient Descent

The next challenge is:

How do we efficiently reduce the cost?

The answer is:

Gradient Descent

Gradient Descent is the optimization algorithm that helps models find parameter values producing the minimum cost.

It is the engine that powers learning in many Machine Learning algorithms.

Real-World Example

House Price Prediction:

Features:

Area
Bedrooms
Location Score

Model:

Price= \beta_0+ \beta_1(Area)+ \beta_2(Bedrooms)+ \beta_3(Location)

Different coefficient values produce different costs.

Training aims to find the coefficients with the lowest possible cost.

Characteristics of a Good Cost Function

A good cost function should:

Measure prediction quality accurately
Be easy to compute
Be differentiable
Guide optimization effectively

MSE satisfies these requirements, making it the most commonly used regression cost function.

Common Mistakes

Assuming Lower Training Cost Means Perfect Model

A very low training cost may indicate:

Overfitting
Memorization

Always evaluate on unseen data.

Comparing Different Metrics Incorrectly

MSE, RMSE, and MAE have different scales and interpretations.

Ignoring Outliers

MSE is sensitive to outliers because squared errors grow rapidly.

Best Practices

Understand the business problem
Choose appropriate error metrics
Monitor both training and validation errors
Investigate unusually high errors
Use RMSE for easier interpretation

Cost Function Workflow

A typical workflow is:

Make predictions
Calculate residuals
Compute cost
Measure model quality
Adjust parameters
Reduce cost
Repeat until convergence

Why Cost Functions are Important

Cost Functions are the foundation of Machine Learning optimization. They provide a numerical measure of model performance and define what it means for a model to improve.

Without a cost function, a model would have no way to determine whether its predictions are getting better or worse. Every learning algorithm ultimately depends on minimizing some form of cost, making this concept one of the most fundamental ideas in Machine Learning.

In the next article, we will learn Gradient Descent, the optimization algorithm that systematically reduces the cost function and helps Machine Learning models learn the best parameters.