Loss Functions

When a Machine Learning model makes a prediction, we need a way to measure how wrong that prediction is. That measurement is the job of the loss function. It turns the gap between the model's prediction and the correct answer into a single number — and the entire goal of training is to make that number as small as possible.

In other words, the loss function is the model's report card. A high loss means the model is doing badly; a low loss means it's doing well. Training works by continually adjusting the model's weights to reduce the loss.

💡 In one line: A loss function measures how wrong a model's predictions are, and training is the process of minimising it.

What is a Loss Function?

A loss function compares the model's predicted output with the actual (true) value and outputs a number representing the error:

  • Large loss → predictions are far from the truth (bad).
  • Small loss → predictions are close to the truth (good).
  • Zero loss → perfect predictions (rare in practice).

You'll also hear the term cost function. The two are closely related:

  • Loss = the error for a single example.
  • Cost = the average loss across the whole dataset.

They're often used interchangeably, but that's the technical difference.

Why Are Loss Functions Important?

The loss function does two essential jobs:

  1. It defines what "good" means. Choosing a loss function tells the model exactly what to optimise for — accuracy, closeness to a number, probability quality, and so on.
  2. It guides training. During backpropagation, the model calculates how the loss changes with each weight, then gradient descent nudges the weights in the direction that lowers the loss. Without a loss function, the model would have no signal to learn from.

Loss Functions for Regression

Regression predicts continuous numbers, so loss is based on how far off the predictions are.

Mean Squared Error (MSE)

MSE = (1/n) Σ (yᵢ − ŷᵢ)²

Squares each error before averaging, so large errors are punished much more than small ones. The most common regression loss.

Mean Absolute Error (MAE)

MAE = (1/n) Σ |yᵢ − ŷᵢ|

Averages the absolute errors. Treats all errors proportionally, making it more robust to outliers than MSE.

Huber Loss

A hybrid — behaves like MSE for small errors and like MAE for large ones, getting the best of both.

LossBehaviourBest when
MSEPunishes large errors heavilyOutliers are rare and large mistakes matter
MAETreats errors evenlyData has outliers
HuberMix of bothYou want balance

Loss Functions for Classification

Classification predicts categories, so loss is based on how good the predicted probabilities are.

Binary Cross-Entropy (Log Loss)

Loss = −(1/n) Σ [ yᵢ·log(ŷᵢ) + (1−yᵢ)·log(1−ŷᵢ) ]

Used for two-class problems (yes/no). It heavily penalises predictions that are confident and wrong.

Categorical Cross-Entropy

The multi-class version, used with a Softmax output layer for problems with three or more classes.

Hinge Loss

Used mainly with Support Vector Machines for classification.

LossUsed for
Binary Cross-EntropyTwo-class classification
Categorical Cross-EntropyMulti-class classification
Hinge LossSVMs

How Loss Connects to Training

The loss function sits at the heart of the training loop:

  1. The model makes a prediction (forward pass).
  2. The loss function measures how wrong it is.
  3. Backpropagation calculates how each weight affected the loss.
  4. Gradient descent adjusts the weights to reduce the loss.
  5. Repeat over many epochs until the loss is as low as possible.

📌 The big picture: "Training a model" really means "finding the weights that minimise the loss function."

Choosing the Right Loss Function

A quick guide:

  • RegressionMSE by default; use MAE if your data has outliers.
  • Binary classificationBinary Cross-Entropy.
  • Multi-class classificationCategorical Cross-Entropy.

Matching the loss function to the task is essential — the wrong choice tells the model to optimise for the wrong thing.

Summary

  • A loss function measures how wrong a model's predictions are, as a single number.
  • Loss is the error for one example; cost is the average loss over the dataset.
  • Regression uses MSE, MAE, or Huber; classification uses Cross-Entropy or Hinge loss.
  • MSE punishes large errors heavily; MAE is more robust to outliers.
  • Training works by minimising the loss through backpropagation and gradient descent — choosing the right loss for the task is critical.