Cross Entropy Loss in Machine Learning

Last updated: Jun 13, 2026

Author :

Christy Harshitha Dakarapu

In the previous article, we learned how Logistic Regression predicts probabilities using the Sigmoid Function.

For example:

Actual Class	Predicted Probability
1	0.95
1	0.80
0	0.10

These predictions appear good because the probabilities are close to the true labels.

But consider another model:

Actual Class	Predicted Probability
1	0.51
1	0.55
0	0.49

This model still predicts the correct classes, but it is much less confident.

This raises an important question:

How do we measure the quality of probability predictions?

In Regression problems, we used:

MAE
MSE
RMSE

However, classification models require a different approach.

The most widely used loss function for classification is:

Cross Entropy Loss

It is the loss function behind:

Logistic Regression
Neural Networks
Deep Learning Models
Binary Classification Systems
Multi-Class Classification Systems

What is Cross Entropy Loss?

Cross Entropy Loss measures how different the predicted probabilities are from the actual labels.

It evaluates:


Actual Outcome
       vs
Predicted Probability

The better the probability prediction,

the lower the loss.

Why Do We Need Cross Entropy Loss?

Suppose:

Actual Label:

Model A:


Predicted Probability = 0.95

Model B:


Predicted Probability = 0.55

Both predict Class 1.

However:

Model A is much more confident.

Cross Entropy Loss rewards Model A and penalizes Model B.

Why Mean Squared Error is Not Ideal

Some beginners wonder:

"Why not use MSE for classification?"

Example:

Actual:

1

Prediction:

0.9

MSE:

(1-0.9)^2 = 0.01

While MSE works mathematically, it creates optimization difficulties and slower learning for classification models.

Cross Entropy Loss provides better gradients and faster learning.

Understanding Probability Quality

Consider:

Actual Label:

Prediction:


0.99

Excellent prediction.

Now:

Prediction:


0.51

Still correct, but barely.

A good loss function should:

Reward confident correct predictions
Penalize confident wrong predictions

Cross Entropy does exactly this.

Binary Classification Setup

Suppose:


Pass = 1

Fail = 0

Actual Labels:

Student	Actual
A	1
B	0

The model predicts probabilities.

Binary Cross Entropy Formula

The Binary Cross Entropy Loss is:

$L=-(y\log(p)+(1-y)\log(1-p))$

Where:

$y$ = Actual Label
$p$ = Predicted Probability
$L$ = Loss

This is the most common loss function for binary classification.

Understanding the Formula

The formula looks intimidating at first.

Let's simplify it.

There are only two cases:

Case 1: Actual Class = 1

Formula becomes:

$L=-\log(p)$

Case 2: Actual Class = 0

Formula becomes:

$L=-\log(1-p)$

These two equations are much easier to understand.

Example 1: Correct Prediction

Actual:

Predicted:


0.99

Loss:

-\log(0.99)

0.01

Very small loss.

The model is rewarded.

Example 2: Moderate Prediction

Actual:

Predicted:


0.70

Loss:

-\log(0.70)

0.357

Higher loss.

Less confidence.

Example 3: Wrong Prediction

Actual:

Predicted:


0.01
\]

Loss:

\[
-\log(0.01)
\]

\[
4.605
\]

Huge loss.

The model is heavily penalized.

# Why Logarithms Are Used

The logarithm creates an important behavior.

Correct predictions:

```text
Probability → 1
Loss → 0

Incorrect predictions:


Probability → 0
Loss → Very Large

This strongly discourages confident mistakes.

Visualizing Loss

For Actual Class = 1:


Loss
 ^
 |
 |\
 | \
 |  \
 |   \
 |     \____
 +------------------>
      Probability

As probability increases,

loss decreases rapidly.

Understanding the Penalty

Consider:

Prediction	Loss
0.99	0.01
0.90	0.10
0.80	0.22
0.50	0.69
0.10	2.30
0.01	4.61

Notice:

Wrong confident predictions receive massive penalties.

Why This is Useful

Suppose two models:

Model A:


Probability = 0.95

Model B:


Probability = 0.55

Both predict the same class.

Cross Entropy identifies that Model A is clearly better.

Average Loss Over Dataset

For multiple observations:

Cross Entropy Loss is averaged.

Formula:

$J=-\frac{1}{m}\sum_{i=1}^{m}[y_i\log(p_i)+(1-y_i)\log(1-p_i)]$

Where:

$m$ = Number of samples

This becomes the cost function optimized during training.

Relationship with Logistic Regression

Logistic Regression:

Step 1:

Compute:

z

Step 2:

Apply Sigmoid:

p

Step 3:

Calculate Cross Entropy Loss.

Step 4:

Use Gradient Descent to reduce the loss.

Workflow:


Features
    ↓
Linear Equation
    ↓
Sigmoid Function
    ↓
Probability
    ↓
Cross Entropy Loss
    ↓
Gradient Descent

Multi-Class Classification

For more than two classes, we use:

Categorical Cross Entropy

Examples:

Cat
Dog
Horse
Bird

The model predicts probabilities for all classes.

The loss evaluates how close those probabilities are to the correct class.

Why Cross Entropy Works So Well

Cross Entropy has several advantages:

Strong penalties for wrong predictions
Smooth gradients
Fast optimization
Probability-based evaluation
Works naturally with Logistic Regression

Python Example

Using Scikit-Learn:


from sklearn.metrics import log_loss

loss = log_loss(
    y_true,
    y_pred_prob
)

print(loss)

TensorFlow Example


from tensorflow.keras.losses import BinaryCrossentropy

loss_fn = BinaryCrossentropy()

PyTorch Example


import torch.nn as nn

criterion = nn.BCELoss()

Example: Spam Detection

Actual:


Spam = 1

Predicted Probabilities:

Model A:


0.95

Model B:


0.55

Cross Entropy Loss prefers Model A because it demonstrates greater confidence.

Example: Disease Prediction

Patient:

Actually has disease.

Predictions:

Model	Probability
A	0.98
B	0.60

Model A receives much lower loss.

Advantages of Cross Entropy Loss

Ideal for classification
Probability-aware
Differentiable
Works well with Gradient Descent
Encourages confident correct predictions

Limitations of Cross Entropy Loss

Sensitive to mislabeled data
Large penalties can sometimes amplify noisy labels
Less interpretable than accuracy

Common Mistakes

Using Accuracy as a Loss Function

Accuracy cannot be optimized directly because it is not differentiable.

Cross Entropy solves this problem.

Confusing Loss with Accuracy

Low loss usually indicates good performance.

However:

Low loss does not always mean perfect classification.

Ignoring Probabilities

Two models with identical accuracy may have very different losses.

Best Practices

Use Binary Cross Entropy for binary classification
Use Categorical Cross Entropy for multi-class classification
Monitor both loss and accuracy
Analyze probability outputs
Use proper validation datasets

Cross Entropy Loss Workflow

Compute probabilities
Compare with actual labels
Calculate loss
Measure prediction quality
Apply Gradient Descent
Update parameters
Repeat until convergence

Cross Entropy vs MSE

Metric	MSE	Cross Entropy
Designed for Regression	Yes	No
Designed for Classification	No	Yes
Probability-Based	No	Yes
Faster Learning	No	Yes
Common in Deep Learning	Rarely	Extremely Common

Why Cross Entropy Loss is Important

Cross Entropy Loss is the foundation of modern classification systems. It provides a mathematically sound way to measure the quality of probability predictions and strongly encourages models to make confident and correct decisions.

From Logistic Regression to state-of-the-art Deep Learning systems, Cross Entropy remains one of the most important loss functions in Machine Learning because it directly connects probability estimation with effective learning.

In the next article, we will study the Confusion Matrix, the fundamental evaluation tool used to understand exactly how classification models make correct and incorrect predictions.