In the previous article, we learned how Logistic Regression predicts probabilities using the Sigmoid Function.

For example:

Actual ClassPredicted Probability
10.95
10.80
00.10

These predictions appear good because the probabilities are close to the true labels.

But consider another model:

Actual ClassPredicted Probability
10.51
10.55
00.49

This model still predicts the correct classes, but it is much less confident.

This raises an important question:

How do we measure the quality of probability predictions?

In Regression problems, we used:

  • MAE
  • MSE
  • RMSE

However, classification models require a different approach.

The most widely used loss function for classification is:

Cross Entropy Loss

It is the loss function behind:

  • Logistic Regression
  • Neural Networks
  • Deep Learning Models
  • Binary Classification Systems
  • Multi-Class Classification Systems

What is Cross Entropy Loss?

Cross Entropy Loss measures how different the predicted probabilities are from the actual labels.

It evaluates:

Actual Outcome
vs
Predicted Probability

The better the probability prediction,

the lower the loss.

Why Do We Need Cross Entropy Loss?

Suppose:

Actual Label:

1

Model A:

Predicted Probability = 0.95

Model B:

Predicted Probability = 0.55

Both predict Class 1.

However:

Model A is much more confident.

Cross Entropy Loss rewards Model A and penalizes Model B.

Why Mean Squared Error is Not Ideal

Some beginners wonder:

"Why not use MSE for classification?"

Example:

Actual:

11

Prediction:

0.90.9

MSE:

(10.9)2=0.01(1-0.9)^2 = 0.01

While MSE works mathematically, it creates optimization difficulties and slower learning for classification models.

Cross Entropy Loss provides better gradients and faster learning.

Understanding Probability Quality

Consider:

Actual Label:

1

Prediction:

0.99

Excellent prediction.

Now:

Prediction:

0.51

Still correct, but barely.

A good loss function should:

  • Reward confident correct predictions
  • Penalize confident wrong predictions

Cross Entropy does exactly this.

Binary Classification Setup

Suppose:

Pass = 1

Fail = 0

Actual Labels:

StudentActual
A1
B0

The model predicts probabilities.

Binary Cross Entropy Formula

The Binary Cross Entropy Loss is:

L=(ylog(p)+(1y)log(1p))L=-(y\log(p)+(1-y)\log(1-p))

Where:

  • yy = Actual Label
  • pp = Predicted Probability
  • LL = Loss

This is the most common loss function for binary classification.

Understanding the Formula

The formula looks intimidating at first.

Let's simplify it.

There are only two cases:

Case 1: Actual Class = 1

Formula becomes:

L=log(p)L=-\log(p)

Case 2: Actual Class = 0

Formula becomes:

L=log(1p)L=-\log(1-p)

These two equations are much easier to understand.

Example 1: Correct Prediction

Actual:

1

Predicted:

0.99

Loss:

log(0.99)-\log(0.99) 0.010.01

Very small loss.

The model is rewarded.

Example 2: Moderate Prediction

Actual:

1

Predicted:

0.70

Loss:

log(0.70)-\log(0.70) 0.3570.357

Higher loss.

Less confidence.

Example 3: Wrong Prediction

Actual:

1

Predicted:

0.01
\]

Loss:

\[
-\log(0.01)
\]

\[
4.605
\]

Huge loss.

The model is heavily penalized.

# Why Logarithms Are Used

The logarithm creates an important behavior.

Correct predictions:

```text
Probability → 1
Loss → 0

Incorrect predictions:

Probability → 0
Loss → Very Large

This strongly discourages confident mistakes.

Visualizing Loss

For Actual Class = 1:

Loss
^
|
|\
| \
| \
| \
| \____
+------------------>
Probability

As probability increases,

loss decreases rapidly.

Understanding the Penalty

Consider:

PredictionLoss
0.990.01
0.900.10
0.800.22
0.500.69
0.102.30
0.014.61

Notice:

Wrong confident predictions receive massive penalties.

Why This is Useful

Suppose two models:

Model A:

Probability = 0.95

Model B:

Probability = 0.55

Both predict the same class.

Cross Entropy identifies that Model A is clearly better.

Average Loss Over Dataset

For multiple observations:

Cross Entropy Loss is averaged.

Formula:

J=1mi=1m[yilog(pi)+(1yi)log(1pi)]J=-\frac{1}{m}\sum_{i=1}^{m}[y_i\log(p_i)+(1-y_i)\log(1-p_i)]

Where:

  • mm = Number of samples

This becomes the cost function optimized during training.

Relationship with Logistic Regression

Logistic Regression:

Step 1:

Compute:

zz

Step 2:

Apply Sigmoid:

pp

Step 3:

Calculate Cross Entropy Loss.

Step 4:

Use Gradient Descent to reduce the loss.

Workflow:

Features

Linear Equation

Sigmoid Function

Probability

Cross Entropy Loss

Gradient Descent

Multi-Class Classification

For more than two classes, we use:

Categorical Cross Entropy

Examples:

  • Cat
  • Dog
  • Horse
  • Bird

The model predicts probabilities for all classes.

The loss evaluates how close those probabilities are to the correct class.

Why Cross Entropy Works So Well

Cross Entropy has several advantages:

  • Strong penalties for wrong predictions
  • Smooth gradients
  • Fast optimization
  • Probability-based evaluation
  • Works naturally with Logistic Regression

Python Example

Using Scikit-Learn:

from sklearn.metrics import log_loss

loss = log_loss(
y_true,
y_pred_prob
)

print(loss)

TensorFlow Example

from tensorflow.keras.losses import BinaryCrossentropy

loss_fn = BinaryCrossentropy()

PyTorch Example

import torch.nn as nn

criterion = nn.BCELoss()

Example: Spam Detection

Actual:

Spam = 1

Predicted Probabilities:

Model A:

0.95

Model B:

0.55

Cross Entropy Loss prefers Model A because it demonstrates greater confidence.

Example: Disease Prediction

Patient:

Actually has disease.

Predictions:

ModelProbability
A0.98
B0.60

Model A receives much lower loss.

Advantages of Cross Entropy Loss

  • Ideal for classification
  • Probability-aware
  • Differentiable
  • Works well with Gradient Descent
  • Encourages confident correct predictions

Limitations of Cross Entropy Loss

  • Sensitive to mislabeled data
  • Large penalties can sometimes amplify noisy labels
  • Less interpretable than accuracy

Common Mistakes

Using Accuracy as a Loss Function

Accuracy cannot be optimized directly because it is not differentiable.

Cross Entropy solves this problem.

Confusing Loss with Accuracy

Low loss usually indicates good performance.

However:

Low loss does not always mean perfect classification.

Ignoring Probabilities

Two models with identical accuracy may have very different losses.

Best Practices

  • Use Binary Cross Entropy for binary classification
  • Use Categorical Cross Entropy for multi-class classification
  • Monitor both loss and accuracy
  • Analyze probability outputs
  • Use proper validation datasets

Cross Entropy Loss Workflow

  1. Compute probabilities
  2. Compare with actual labels
  3. Calculate loss
  4. Measure prediction quality
  5. Apply Gradient Descent
  6. Update parameters
  7. Repeat until convergence

Cross Entropy vs MSE

MetricMSECross Entropy
Designed for RegressionYesNo
Designed for ClassificationNoYes
Probability-BasedNoYes
Faster LearningNoYes
Common in Deep LearningRarelyExtremely Common

Why Cross Entropy Loss is Important

Cross Entropy Loss is the foundation of modern classification systems. It provides a mathematically sound way to measure the quality of probability predictions and strongly encourages models to make confident and correct decisions.

From Logistic Regression to state-of-the-art Deep Learning systems, Cross Entropy remains one of the most important loss functions in Machine Learning because it directly connects probability estimation with effective learning.

In the next article, we will study the Confusion Matrix, the fundamental evaluation tool used to understand exactly how classification models make correct and incorrect predictions.