In the previous article, we learned how Logistic Regression predicts probabilities using the Sigmoid Function.
For example:
| Actual Class | Predicted Probability |
|---|---|
| 1 | 0.95 |
| 1 | 0.80 |
| 0 | 0.10 |
These predictions appear good because the probabilities are close to the true labels.
But consider another model:
| Actual Class | Predicted Probability |
|---|---|
| 1 | 0.51 |
| 1 | 0.55 |
| 0 | 0.49 |
This model still predicts the correct classes, but it is much less confident.
This raises an important question:
How do we measure the quality of probability predictions?
In Regression problems, we used:
- MAE
- MSE
- RMSE
However, classification models require a different approach.
The most widely used loss function for classification is:
Cross Entropy Loss
It is the loss function behind:
- Logistic Regression
- Neural Networks
- Deep Learning Models
- Binary Classification Systems
- Multi-Class Classification Systems
What is Cross Entropy Loss?
Cross Entropy Loss measures how different the predicted probabilities are from the actual labels.
It evaluates:
Actual Outcome
vs
Predicted Probability
The better the probability prediction,
the lower the loss.
Why Do We Need Cross Entropy Loss?
Suppose:
Actual Label:
1
Model A:
Predicted Probability = 0.95
Model B:
Predicted Probability = 0.55
Both predict Class 1.
However:
Model A is much more confident.
Cross Entropy Loss rewards Model A and penalizes Model B.
Why Mean Squared Error is Not Ideal
Some beginners wonder:
"Why not use MSE for classification?"
Example:
Actual:
1Prediction:
0.9MSE:
(1−0.9)2=0.01While MSE works mathematically, it creates optimization difficulties and slower learning for classification models.
Cross Entropy Loss provides better gradients and faster learning.
Understanding Probability Quality
Consider:
Actual Label:
1
Prediction:
0.99
Excellent prediction.
Now:
Prediction:
0.51
Still correct, but barely.
A good loss function should:
- Reward confident correct predictions
- Penalize confident wrong predictions
Cross Entropy does exactly this.
Binary Classification Setup
Suppose:
Pass = 1
Fail = 0
Actual Labels:
| Student | Actual |
|---|---|
| A | 1 |
| B | 0 |
The model predicts probabilities.
Binary Cross Entropy Formula
The Binary Cross Entropy Loss is:
L=−(ylog(p)+(1−y)log(1−p))
Where:
- y = Actual Label
- p = Predicted Probability
- L = Loss
This is the most common loss function for binary classification.
Understanding the Formula
The formula looks intimidating at first.
Let's simplify it.
There are only two cases:
Case 1: Actual Class = 1
Formula becomes:
L=−log(p)
Case 2: Actual Class = 0
Formula becomes:
L=−log(1−p)
These two equations are much easier to understand.
Example 1: Correct Prediction
Actual:
1
Predicted:
0.99
Loss:
−log(0.99) 0.01Very small loss.
The model is rewarded.
Example 2: Moderate Prediction
Actual:
1
Predicted:
0.70
Loss:
−log(0.70) 0.357Higher loss.
Less confidence.
Example 3: Wrong Prediction
Actual:
1
Predicted:
0.01
\]
Loss:
\[
-\log(0.01)
\]
\[
4.605
\]
Huge loss.
The model is heavily penalized.
# Why Logarithms Are Used
The logarithm creates an important behavior.
Correct predictions:
```text
Probability → 1
Loss → 0
Incorrect predictions:
Probability → 0
Loss → Very Large
This strongly discourages confident mistakes.
Visualizing Loss
For Actual Class = 1:
Loss
^
|
|\
| \
| \
| \
| \____
+------------------>
Probability
As probability increases,
loss decreases rapidly.
Understanding the Penalty
Consider:
| Prediction | Loss |
|---|---|
| 0.99 | 0.01 |
| 0.90 | 0.10 |
| 0.80 | 0.22 |
| 0.50 | 0.69 |
| 0.10 | 2.30 |
| 0.01 | 4.61 |
Notice:
Wrong confident predictions receive massive penalties.
Why This is Useful
Suppose two models:
Model A:
Probability = 0.95
Model B:
Probability = 0.55
Both predict the same class.
Cross Entropy identifies that Model A is clearly better.
Average Loss Over Dataset
For multiple observations:
Cross Entropy Loss is averaged.
Formula:
J=−m1∑i=1m[yilog(pi)+(1−yi)log(1−pi)]
Where:
- m = Number of samples
This becomes the cost function optimized during training.
Relationship with Logistic Regression
Logistic Regression:
Step 1:
Compute:
zStep 2:
Apply Sigmoid:
pStep 3:
Calculate Cross Entropy Loss.
Step 4:
Use Gradient Descent to reduce the loss.
Workflow:
Features
↓
Linear Equation
↓
Sigmoid Function
↓
Probability
↓
Cross Entropy Loss
↓
Gradient Descent
Multi-Class Classification
For more than two classes, we use:
Categorical Cross Entropy
Examples:
- Cat
- Dog
- Horse
- Bird
The model predicts probabilities for all classes.
The loss evaluates how close those probabilities are to the correct class.
Why Cross Entropy Works So Well
Cross Entropy has several advantages:
- Strong penalties for wrong predictions
- Smooth gradients
- Fast optimization
- Probability-based evaluation
- Works naturally with Logistic Regression
Python Example
Using Scikit-Learn:
from sklearn.metrics import log_loss
loss = log_loss(
y_true,
y_pred_prob
)
print(loss)
TensorFlow Example
from tensorflow.keras.losses import BinaryCrossentropy
loss_fn = BinaryCrossentropy()
PyTorch Example
import torch.nn as nn
criterion = nn.BCELoss()
Example: Spam Detection
Actual:
Spam = 1
Predicted Probabilities:
Model A:
0.95
Model B:
0.55
Cross Entropy Loss prefers Model A because it demonstrates greater confidence.
Example: Disease Prediction
Patient:
Actually has disease.
Predictions:
| Model | Probability |
|---|---|
| A | 0.98 |
| B | 0.60 |
Model A receives much lower loss.
Advantages of Cross Entropy Loss
- Ideal for classification
- Probability-aware
- Differentiable
- Works well with Gradient Descent
- Encourages confident correct predictions
Limitations of Cross Entropy Loss
- Sensitive to mislabeled data
- Large penalties can sometimes amplify noisy labels
- Less interpretable than accuracy
Common Mistakes
Using Accuracy as a Loss Function
Accuracy cannot be optimized directly because it is not differentiable.
Cross Entropy solves this problem.
Confusing Loss with Accuracy
Low loss usually indicates good performance.
However:
Low loss does not always mean perfect classification.
Ignoring Probabilities
Two models with identical accuracy may have very different losses.
Best Practices
- Use Binary Cross Entropy for binary classification
- Use Categorical Cross Entropy for multi-class classification
- Monitor both loss and accuracy
- Analyze probability outputs
- Use proper validation datasets
Cross Entropy Loss Workflow
- Compute probabilities
- Compare with actual labels
- Calculate loss
- Measure prediction quality
- Apply Gradient Descent
- Update parameters
- Repeat until convergence
Cross Entropy vs MSE
| Metric | MSE | Cross Entropy |
|---|---|---|
| Designed for Regression | Yes | No |
| Designed for Classification | No | Yes |
| Probability-Based | No | Yes |
| Faster Learning | No | Yes |
| Common in Deep Learning | Rarely | Extremely Common |
Why Cross Entropy Loss is Important
Cross Entropy Loss is the foundation of modern classification systems. It provides a mathematically sound way to measure the quality of probability predictions and strongly encourages models to make confident and correct decisions.
From Logistic Regression to state-of-the-art Deep Learning systems, Cross Entropy remains one of the most important loss functions in Machine Learning because it directly connects probability estimation with effective learning.
In the next article, we will study the Confusion Matrix, the fundamental evaluation tool used to understand exactly how classification models make correct and incorrect predictions.