Precision, Recall, and F1 Score

Last updated: Jun 13, 2026

Author :

Christy Harshitha Dakarapu

In the previous article, we learned about the Confusion Matrix, which breaks classification predictions into:

True Positives (TP)
True Negatives (TN)
False Positives (FP)
False Negatives (FN)

We also discovered an important problem:

A model can achieve very high accuracy and still be practically useless.

Consider a fraud detection system:

Transaction Type	Count
Genuine	990
Fraud	10

Suppose a model predicts:


Everything is Genuine

Accuracy:

99\%

This looks impressive.

However:

The model detects:

0

frauds.

Clearly, accuracy alone cannot tell the complete story.

To solve this problem, Machine Learning uses three extremely important evaluation metrics:

Precision
Recall
F1 Score

These metrics help us understand different aspects of classification performance and are widely used in:

Healthcare
Fraud Detection
Cybersecurity
Search Engines
Recommendation Systems
Deep Learning

Why Accuracy is Not Enough

Consider the following confusion matrix.

Actual / Predicted	Positive	Negative
Positive	10	90
Negative	0	900

Accuracy:

\frac{10+900}{1000} = 91\%

Looks good.

But:

The model misses:

90

positive cases.

This can be disastrous in applications such as disease detection.

Recap: Confusion Matrix

Actual / Predicted	Positive	Negative
Positive	TP	FN
Negative	FP	TN

Where:

TP = True Positive
TN = True Negative
FP = False Positive
FN = False Negative

Precision, Recall, and F1 Score are derived directly from these values.

What is Precision?

Precision answers the question:

Out of all positive predictions, how many were actually positive?

Formula:

$Precision=\frac{TP}{TP+FP}$

Understanding Precision Intuitively

Suppose a model predicts:


100 Emails are Spam

Reality:


80 Actually Spam
20 Not Spam

Precision:

\frac{80}{100} = 0.8

80\%

Precision Interpretation

High Precision means:


When the model predicts Positive,
it is usually correct.

Example

Fraud Detection:

Predicted Fraud:

100 transactions

Actually Fraud:

95 transactions

Precision:

95\%

Very high precision.

Why Precision Matters

Precision is important when:

False Positives are costly.

Examples:

Spam Detection
Loan Approval
Legal Systems
Content Moderation

Spam Detection Example

False Positive:


Important Email
       ↓
Marked as Spam

This is undesirable.

High Precision minimizes such mistakes.

What is Recall?

Recall answers the question:

Out of all actual positives, how many did the model correctly identify?

Formula:

$Recall=\frac{TP}{TP+FN}$

Understanding Recall Intuitively

Suppose:

Actual Fraud Cases:

Detected Fraud Cases:

Recall:

\frac{90}{100} = 0.9

90\%

Recall Interpretation

High Recall means:


The model finds most positive cases.

Why Recall Matters

Recall becomes critical when:

False Negatives are dangerous.

Examples:

Cancer Detection
Fraud Detection
Security Systems
Disaster Prediction

Medical Diagnosis Example

False Negative:


Patient Has Cancer
         ↓
Model Predicts Healthy

This can be life-threatening.

High Recall minimizes missed cases.

Precision vs Recall

These metrics focus on different goals.

Precision Focus


When Positive is Predicted,
Be Correct

Recall Focus


Find As Many Positives
As Possible

Example

Suppose:

Actual Positive Cases:

100

Model A:

Detects:

Precision:

100%

Recall:

50%

Model B:

Detects:

Precision:

70%

Recall:

95%

Different models prioritize different objectives.

Visualizing Precision


Predicted Positive
      ↓

Correct Positive
Incorrect Positive

      ↑
 Precision Measures This

Visualizing Recall


Actual Positive Cases
       ↓

Found Positives
Missed Positives

      ↑
 Recall Measures This

Precision Example Calculation

Confusion Matrix:

Actual / Predicted	Positive	Negative
Positive	80	20
Negative	10	90

Precision:

\frac{80}{80+10}

0.889

88.9\%

Recall Example Calculation

Same matrix:

Recall:

\frac{80}{80+20}

0.8

80\%

The Precision-Recall Tradeoff

Improving one often reduces the other.

Example:

Very Strict Model:


Predict Positive
Only When Extremely Certain

Results:

High Precision
Low Recall

Very Relaxed Model


Predict Positive Frequently

Results:

High Recall
Lower Precision

Example: Airport Security

Strict Screening:

More people flagged.

Results:

High Recall
Lower Precision

Relaxed Screening:

Fewer people flagged.

Results:

Higher Precision
Lower Recall

What is F1 Score?

Sometimes we need a balance between Precision and Recall.

F1 Score combines both into a single metric.

Formula:

$F1=2\times\frac{Precision\times Recall}{Precision+Recall}$

Why Not Use Average?

Suppose:

Precision:

100%

Recall:

Average:

50%

This looks acceptable.

However:

The model is actually useless.

F1 uses the harmonic mean, which penalizes extreme imbalance.

Example Calculation

Precision:

0.8

Recall:

0.6

F1:

2\times \frac{0.8\times0.6} {0.8+0.6}

0.686

F1 Score Interpretation

High F1 means:

High Precision
High Recall

Balanced performance.

F1 Score Range

Value	Interpretation
1.0	Perfect
0.8	Very Good
0.5	Moderate
0.0	Poor

Comparing Metrics

Suppose:

Metric	Value
Accuracy	95%
Precision	50%
Recall	40%
F1 Score	44%

Despite high accuracy,

the classifier performs poorly.

Example: Disease Detection

Confusion Matrix:

Actual / Predicted	Disease	Healthy
Disease	90	10
Healthy	20	180

Precision:

\frac{90}{90+20} = 81.8\%

Recall:

\frac{90}{90+10} = 90\%

F1 Score:

85.7\%

This model performs well.

Example: Spam Detection

Suppose:

Precision:

95%

Recall:

60%

Interpretation:

Most detected spam emails are truly spam.

However:

Many spam emails are still reaching the inbox.

Choosing the Right Metric

When Precision Matters Most

Examples:

Spam Detection
Loan Approval
Search Results

Goal:

Avoid false positives.

When Recall Matters Most

Examples:

Cancer Detection
Fraud Detection
Intrusion Detection

Goal:

Avoid false negatives.

When F1 Score Matters Most

Examples:

Imbalanced Datasets
General Classification Problems
Production ML Systems

Goal:

Balance precision and recall.

Python Implementation

Precision:


from sklearn.metrics import precision_score

precision_score(
    y_true,
    y_pred
)

Recall:


from sklearn.metrics import recall_score

recall_score(
    y_true,
    y_pred
)

F1 Score:


from sklearn.metrics import f1_score

f1_score(
    y_true,
    y_pred
)

Classification Report

Scikit-Learn provides all metrics together.


from sklearn.metrics import classification_report

print(
    classification_report(
        y_true,
        y_pred
    )
)

Example Output:


Precision: 0.88
Recall:    0.84
F1 Score:  0.86

Real-World Applications

Healthcare

High Recall preferred.

Missing a disease is costly.

Fraud Detection

High Recall preferred.

Missing fraud is expensive.

Search Engines

High Precision preferred.

Users want relevant results.

Recommendation Systems

Balanced Precision and Recall often desired.

Common Mistakes

Using Accuracy Alone

Accuracy can be misleading.

Ignoring Business Context

Different applications require different priorities.

Chasing Precision Only

High precision with low recall may miss important cases.

Chasing Recall Only

High recall with low precision may generate too many false alarms.

Best Practices

Always analyze the confusion matrix first
Calculate Precision and Recall together
Use F1 Score when classes are imbalanced
Select metrics based on business requirements
Evaluate models on unseen test data

Precision, Recall, and F1 Score Summary

Metric	Formula	Focus
Precision	TP / (TP + FP)	Prediction Quality
Recall	TP / (TP + FN)	Positive Detection
F1 Score	Harmonic Mean	Balance

Evaluation Workflow

Build confusion matrix
Calculate Precision
Calculate Recall
Calculate F1 Score
Compare models
Select best model
Optimize threshold if needed

Why Precision, Recall, and F1 Score are Important

Precision, Recall, and F1 Score provide a much deeper understanding of classification performance than accuracy alone. They help reveal whether a model is generating false alarms, missing important cases, or maintaining a healthy balance between both.

These metrics are essential because real-world Machine Learning systems often operate on imbalanced datasets where accuracy can be misleading. Understanding these measures enables practitioners to design models that align with business goals and make more reliable decisions.

In the next article, we will explore ROC Curves and AUC, powerful evaluation tools that analyze classifier performance across different probability thresholds rather than relying on a single threshold such as 0.5.