In the previous articles, we learned:

  • What Regression is
  • How Simple Linear Regression works
  • How Multiple Linear Regression makes predictions

A natural question arises:

How does the model know whether its predictions are good or bad?

Suppose two different regression models make predictions.

Model A:

ActualPredicted
5049
7072
10098

Model B:

ActualPredicted
5020
70130
10060

Clearly, Model A is better.

But how can a computer quantify this difference mathematically?

The answer is:

Cost Function

A Cost Function measures how wrong a model's predictions are. It provides a numerical value representing the overall prediction error.

Machine Learning models learn by minimizing this cost.

In this article, we will understand Cost Functions intuitively, learn why they are necessary, explore popular regression cost functions, and understand their role in model training.

What is a Cost Function?

A Cost Function is a mathematical function that measures the difference between:

  • Actual Values
  • Predicted Values

It tells us:

How bad the model is performing.

Lower Cost:

Better Model

Higher Cost:

Worse Model

Why Do We Need a Cost Function?

Consider two models:

Model A:

ActualPredicted
10099
200201

Model B:

ActualPredicted
10050
200300

Which model is better?

Humans can easily tell.

Computers need a numerical measure.

The Cost Function provides that measure.

Understanding Prediction Error

Suppose:

Actual House Price:

100100

Predicted House Price:

9090

Error:

1010

Prediction Error measures how far the prediction is from reality.

Residual

The difference between actual and predicted values is called a residual.

Formula:

Residual=yactualypredictedResidual=y_{actual}-y_{predicted}

Example:

Actual:

8080

Predicted:

7070

Residual:

1010

Multiple Predictions

Suppose:

ActualPredicted
5045
7072
10095

Residuals:

ActualPredictedResidual
50455
7072-2
100955

Now we need a way to combine all residuals into a single number.

Why We Cannot Simply Add Errors

Consider:

Residuals:

55

and

5-5

Sum:

00

This incorrectly suggests perfect predictions.

Positive and negative errors cancel each other.

We need a better solution.

Squaring the Errors

One solution:

Square every residual.

Example:

ErrorSquared Error
525
-525

Now all values become positive.

This prevents cancellation.

Mean Squared Error (MSE)

The most common regression cost function is:

Mean Squared Error (MSE)

Formula:

MSE=1ni=1n(yiy^i)2MSE=\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat y_i)^2

Where:

  • yiy_i = Actual Value
  • y^i\hat y_i = Predicted Value
  • nn = Number of observations

Understanding MSE Step-by-Step

Dataset:

ActualPredicted
108
2018
3035

Step 1:

Calculate Errors

Error
2
2
-5

Step 2:

Square Errors

Squared Error
4
4
25

Step 3:

Take Mean

MSE=4+4+253MSE= \frac{4+4+25}{3} MSE=11MSE=11

Why MSE is Popular

Advantages:

  • Always positive
  • Easy to calculate
  • Smooth mathematical properties
  • Penalizes large errors heavily

This makes optimization easier.

Understanding Error Penalty

Consider:

Error:

22

Squared Error:

44

Error:

1010

Squared Error:

100100

Large mistakes receive much larger penalties.

This encourages the model to avoid big prediction errors.

Visualizing Cost

Suppose:

Model A:

MSE = 5

Model B:

MSE = 20

Interpretation:

Model A performs better.

The lower the cost, the better the model.

Cost Function and the Best Fit Line

Remember:

Linear Regression tries to find the best-fit line.

Question:

What defines "best"?

Answer:

The line with the lowest cost.

Among thousands of possible lines, the model chooses the one producing the smallest MSE.

Example

Line 1:

y=2x+1y=2x+1

Cost:

5050

Line 2:

y=1.8x+2y=1.8x+2

Cost:

1010

Line 2 is better because its cost is lower.

Cost Function Visualization

Imagine:

Cost
^
|
| *
| *
| *
|*
+------------------->
Model Parameters

The goal is to reach the lowest point.

This point corresponds to the optimal model.

Loss Function vs Cost Function

These terms are often confused.

Loss Function

Measures error for a single training example.

Example:

One house prediction.

Cost Function

Measures average error across the entire dataset.

Example:

Thousands of house predictions.

Relationship:

Individual Losses

Average

Cost Function

Mean Absolute Error (MAE)

Another popular cost function is MAE.

Formula:

MAE=1nyiy^iMAE=\frac{1}{n}\sum |y_i-\hat y_i|

Instead of squaring errors, MAE uses absolute values.

Example

Errors:

2,3,42, -3, 4

Absolute Errors:

2,3,42,3,4

MAE:

2+3+43=3\frac{2+3+4}{3} = 3

MSE vs MAE

MSEMAE
Squares errorsUses absolute values
Punishes large errors heavilyTreats errors equally
Smooth optimizationMore robust to outliers

Root Mean Squared Error (RMSE)

RMSE is the square root of MSE.

Formula:

RMSE=MSERMSE=\sqrt{MSE}

Example:

MSE=25MSE=25

RMSE:

55

Why RMSE is Useful

MSE units:

Price2Price^2

RMSE units:

PricePrice

This makes interpretation easier.

Cost Function in Linear Regression

The cost function for Linear Regression is:

J(θ)=12mi=1m(hθ(xi)yi)2J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2

Where:

  • J(θ)J(\theta) = Cost
  • mm = Number of samples
  • hθ(x)h_\theta(x) = Prediction

This is the function the model minimizes during training.

Why the Lowest Cost Matters

Suppose:

Three candidate models:

ModelCost
A100
B50
C10

Best Model:

Model C

Because:

10<50<10010 < 50 < 100

Optimization Goal

Machine Learning training is essentially:

Start with Random Parameters

Calculate Cost

Adjust Parameters

Reduce Cost

Repeat

This process continues until the cost is minimized.

Enter Gradient Descent

The next challenge is:

How do we efficiently reduce the cost?

The answer is:

Gradient Descent

Gradient Descent is the optimization algorithm that helps models find parameter values producing the minimum cost.

It is the engine that powers learning in many Machine Learning algorithms.

Real-World Example

House Price Prediction:

Features:

  • Area
  • Bedrooms
  • Location Score

Model:

Price=β0+β1(Area)+β2(Bedrooms)+β3(Location)Price= \beta_0+ \beta_1(Area)+ \beta_2(Bedrooms)+ \beta_3(Location)

Different coefficient values produce different costs.

Training aims to find the coefficients with the lowest possible cost.

Characteristics of a Good Cost Function

A good cost function should:

  • Measure prediction quality accurately
  • Be easy to compute
  • Be differentiable
  • Guide optimization effectively

MSE satisfies these requirements, making it the most commonly used regression cost function.

Common Mistakes

Assuming Lower Training Cost Means Perfect Model

A very low training cost may indicate:

  • Overfitting
  • Memorization

Always evaluate on unseen data.

Comparing Different Metrics Incorrectly

MSE, RMSE, and MAE have different scales and interpretations.

Ignoring Outliers

MSE is sensitive to outliers because squared errors grow rapidly.

Best Practices

  • Understand the business problem
  • Choose appropriate error metrics
  • Monitor both training and validation errors
  • Investigate unusually high errors
  • Use RMSE for easier interpretation

Cost Function Workflow

A typical workflow is:

  1. Make predictions
  2. Calculate residuals
  3. Compute cost
  4. Measure model quality
  5. Adjust parameters
  6. Reduce cost
  7. Repeat until convergence

Why Cost Functions are Important

Cost Functions are the foundation of Machine Learning optimization. They provide a numerical measure of model performance and define what it means for a model to improve.

Without a cost function, a model would have no way to determine whether its predictions are getting better or worse. Every learning algorithm ultimately depends on minimizing some form of cost, making this concept one of the most fundamental ideas in Machine Learning.

In the next article, we will learn Gradient Descent, the optimization algorithm that systematically reduces the cost function and helps Machine Learning models learn the best parameters.