In the previous articles, we learned:
- What Regression is
- How Simple Linear Regression works
- How Multiple Linear Regression makes predictions
A natural question arises:
How does the model know whether its predictions are good or bad?
Suppose two different regression models make predictions.
Model A:
| Actual | Predicted |
|---|---|
| 50 | 49 |
| 70 | 72 |
| 100 | 98 |
Model B:
| Actual | Predicted |
|---|---|
| 50 | 20 |
| 70 | 130 |
| 100 | 60 |
Clearly, Model A is better.
But how can a computer quantify this difference mathematically?
The answer is:
Cost Function
A Cost Function measures how wrong a model's predictions are. It provides a numerical value representing the overall prediction error.
Machine Learning models learn by minimizing this cost.
In this article, we will understand Cost Functions intuitively, learn why they are necessary, explore popular regression cost functions, and understand their role in model training.
What is a Cost Function?
A Cost Function is a mathematical function that measures the difference between:
- Actual Values
- Predicted Values
It tells us:
How bad the model is performing.
Lower Cost:
Better Model
Higher Cost:
Worse Model
Why Do We Need a Cost Function?
Consider two models:
Model A:
| Actual | Predicted |
|---|---|
| 100 | 99 |
| 200 | 201 |
Model B:
| Actual | Predicted |
|---|---|
| 100 | 50 |
| 200 | 300 |
Which model is better?
Humans can easily tell.
Computers need a numerical measure.
The Cost Function provides that measure.
Understanding Prediction Error
Suppose:
Actual House Price:
Predicted House Price:
Error:
Prediction Error measures how far the prediction is from reality.
Residual
The difference between actual and predicted values is called a residual.
Formula:
Example:
Actual:
Predicted:
Residual:
Multiple Predictions
Suppose:
| Actual | Predicted |
|---|---|
| 50 | 45 |
| 70 | 72 |
| 100 | 95 |
Residuals:
| Actual | Predicted | Residual |
|---|---|---|
| 50 | 45 | 5 |
| 70 | 72 | -2 |
| 100 | 95 | 5 |
Now we need a way to combine all residuals into a single number.
Why We Cannot Simply Add Errors
Consider:
Residuals:
and
Sum:
This incorrectly suggests perfect predictions.
Positive and negative errors cancel each other.
We need a better solution.
Squaring the Errors
One solution:
Square every residual.
Example:
| Error | Squared Error |
|---|---|
| 5 | 25 |
| -5 | 25 |
Now all values become positive.
This prevents cancellation.
Mean Squared Error (MSE)
The most common regression cost function is:
Mean Squared Error (MSE)
Formula:
Where:
- = Actual Value
- = Predicted Value
- = Number of observations
Understanding MSE Step-by-Step
Dataset:
| Actual | Predicted |
|---|---|
| 10 | 8 |
| 20 | 18 |
| 30 | 35 |
Step 1:
Calculate Errors
| Error |
|---|
| 2 |
| 2 |
| -5 |
Step 2:
Square Errors
| Squared Error |
|---|
| 4 |
| 4 |
| 25 |
Step 3:
Take Mean
Why MSE is Popular
Advantages:
- Always positive
- Easy to calculate
- Smooth mathematical properties
- Penalizes large errors heavily
This makes optimization easier.
Understanding Error Penalty
Consider:
Error:
Squared Error:
Error:
Squared Error:
Large mistakes receive much larger penalties.
This encourages the model to avoid big prediction errors.
Visualizing Cost
Suppose:
Model A:
MSE = 5
Model B:
MSE = 20
Interpretation:
Model A performs better.
The lower the cost, the better the model.
Cost Function and the Best Fit Line
Remember:
Linear Regression tries to find the best-fit line.
Question:
What defines "best"?
Answer:
The line with the lowest cost.
Among thousands of possible lines, the model chooses the one producing the smallest MSE.
Example
Line 1:
Cost:
Line 2:
Cost:
Line 2 is better because its cost is lower.
Cost Function Visualization
Imagine:
Cost
^
|
| *
| *
| *
|*
+------------------->
Model Parameters
The goal is to reach the lowest point.
This point corresponds to the optimal model.
Loss Function vs Cost Function
These terms are often confused.
Loss Function
Measures error for a single training example.
Example:
One house prediction.
Cost Function
Measures average error across the entire dataset.
Example:
Thousands of house predictions.
Relationship:
Individual Losses
↓
Average
↓
Cost Function
Mean Absolute Error (MAE)
Another popular cost function is MAE.
Formula:
Instead of squaring errors, MAE uses absolute values.
Example
Errors:
Absolute Errors:
MAE:
MSE vs MAE
| MSE | MAE |
|---|---|
| Squares errors | Uses absolute values |
| Punishes large errors heavily | Treats errors equally |
| Smooth optimization | More robust to outliers |
Root Mean Squared Error (RMSE)
RMSE is the square root of MSE.
Formula:
Example:
RMSE:
Why RMSE is Useful
MSE units:
RMSE units:
This makes interpretation easier.
Cost Function in Linear Regression
The cost function for Linear Regression is:
Where:
- = Cost
- = Number of samples
- = Prediction
This is the function the model minimizes during training.
Why the Lowest Cost Matters
Suppose:
Three candidate models:
| Model | Cost |
|---|---|
| A | 100 |
| B | 50 |
| C | 10 |
Best Model:
Model C
Because:
Optimization Goal
Machine Learning training is essentially:
Start with Random Parameters
↓
Calculate Cost
↓
Adjust Parameters
↓
Reduce Cost
↓
Repeat
This process continues until the cost is minimized.
Enter Gradient Descent
The next challenge is:
How do we efficiently reduce the cost?
The answer is:
Gradient Descent
Gradient Descent is the optimization algorithm that helps models find parameter values producing the minimum cost.
It is the engine that powers learning in many Machine Learning algorithms.
Real-World Example
House Price Prediction:
Features:
- Area
- Bedrooms
- Location Score
Model:
Different coefficient values produce different costs.
Training aims to find the coefficients with the lowest possible cost.
Characteristics of a Good Cost Function
A good cost function should:
- Measure prediction quality accurately
- Be easy to compute
- Be differentiable
- Guide optimization effectively
MSE satisfies these requirements, making it the most commonly used regression cost function.
Common Mistakes
Assuming Lower Training Cost Means Perfect Model
A very low training cost may indicate:
- Overfitting
- Memorization
Always evaluate on unseen data.
Comparing Different Metrics Incorrectly
MSE, RMSE, and MAE have different scales and interpretations.
Ignoring Outliers
MSE is sensitive to outliers because squared errors grow rapidly.
Best Practices
- Understand the business problem
- Choose appropriate error metrics
- Monitor both training and validation errors
- Investigate unusually high errors
- Use RMSE for easier interpretation
Cost Function Workflow
A typical workflow is:
- Make predictions
- Calculate residuals
- Compute cost
- Measure model quality
- Adjust parameters
- Reduce cost
- Repeat until convergence
Why Cost Functions are Important
Cost Functions are the foundation of Machine Learning optimization. They provide a numerical measure of model performance and define what it means for a model to improve.
Without a cost function, a model would have no way to determine whether its predictions are getting better or worse. Every learning algorithm ultimately depends on minimizing some form of cost, making this concept one of the most fundamental ideas in Machine Learning.
In the next article, we will learn Gradient Descent, the optimization algorithm that systematically reduces the cost function and helps Machine Learning models learn the best parameters.