In the previous article, we learned about the Confusion Matrix, which breaks classification predictions into:
- True Positives (TP)
- True Negatives (TN)
- False Positives (FP)
- False Negatives (FN)
We also discovered an important problem:
A model can achieve very high accuracy and still be practically useless.
Consider a fraud detection system:
| Transaction Type | Count |
|---|---|
| Genuine | 990 |
| Fraud | 10 |
Suppose a model predicts:
Everything is Genuine
Accuracy:
99%This looks impressive.
However:
The model detects:
0frauds.
Clearly, accuracy alone cannot tell the complete story.
To solve this problem, Machine Learning uses three extremely important evaluation metrics:
- Precision
- Recall
- F1 Score
These metrics help us understand different aspects of classification performance and are widely used in:
- Healthcare
- Fraud Detection
- Cybersecurity
- Search Engines
- Recommendation Systems
- Deep Learning
Why Accuracy is Not Enough
Consider the following confusion matrix.
| Actual / Predicted | Positive | Negative |
|---|---|---|
| Positive | 10 | 90 |
| Negative | 0 | 900 |
Accuracy:
100010+900=91%Looks good.
But:
The model misses:
90positive cases.
This can be disastrous in applications such as disease detection.
Recap: Confusion Matrix
| Actual / Predicted | Positive | Negative |
|---|---|---|
| Positive | TP | FN |
| Negative | FP | TN |
Where:
- TP = True Positive
- TN = True Negative
- FP = False Positive
- FN = False Negative
Precision, Recall, and F1 Score are derived directly from these values.
What is Precision?
Precision answers the question:
Out of all positive predictions, how many were actually positive?
Formula:
Precision=TP+FPTP
Understanding Precision Intuitively
Suppose a model predicts:
100 Emails are Spam
Reality:
80 Actually Spam
20 Not Spam
Precision:
10080=0.8 80%Precision Interpretation
High Precision means:
When the model predicts Positive,
it is usually correct.
Example
Fraud Detection:
Predicted Fraud:
100 transactions
Actually Fraud:
95 transactions
Precision:
95%Very high precision.
Why Precision Matters
Precision is important when:
False Positives are costly.
Examples:
- Spam Detection
- Loan Approval
- Legal Systems
- Content Moderation
Spam Detection Example
False Positive:
Important Email
↓
Marked as Spam
This is undesirable.
High Precision minimizes such mistakes.
What is Recall?
Recall answers the question:
Out of all actual positives, how many did the model correctly identify?
Formula:
Recall=TP+FNTP
Understanding Recall Intuitively
Suppose:
Actual Fraud Cases:
100
Detected Fraud Cases:
90
Recall:
10090=0.9 90%Recall Interpretation
High Recall means:
The model finds most positive cases.
Why Recall Matters
Recall becomes critical when:
False Negatives are dangerous.
Examples:
- Cancer Detection
- Fraud Detection
- Security Systems
- Disaster Prediction
Medical Diagnosis Example
False Negative:
Patient Has Cancer
↓
Model Predicts Healthy
This can be life-threatening.
High Recall minimizes missed cases.
Precision vs Recall
These metrics focus on different goals.
Precision Focus
When Positive is Predicted,
Be Correct
Recall Focus
Find As Many Positives
As Possible
Example
Suppose:
Actual Positive Cases:
100
Model A:
Detects:
50
Precision:
100%
Recall:
50%
Model B:
Detects:
95
Precision:
70%
Recall:
95%
Different models prioritize different objectives.
Visualizing Precision
Predicted Positive
↓
Correct Positive
Incorrect Positive
↑
Precision Measures This
Visualizing Recall
Actual Positive Cases
↓
Found Positives
Missed Positives
↑
Recall Measures This
Precision Example Calculation
Confusion Matrix:
| Actual / Predicted | Positive | Negative |
|---|---|---|
| Positive | 80 | 20 |
| Negative | 10 | 90 |
Precision:
80+1080 0.889 88.9%Recall Example Calculation
Same matrix:
Recall:
80+2080 0.8 80%The Precision-Recall Tradeoff
Improving one often reduces the other.
Example:
Very Strict Model:
Predict Positive
Only When Extremely Certain
Results:
- High Precision
- Low Recall
Very Relaxed Model
Predict Positive Frequently
Results:
- High Recall
- Lower Precision
Example: Airport Security
Strict Screening:
More people flagged.
Results:
- High Recall
- Lower Precision
Relaxed Screening:
Fewer people flagged.
Results:
- Higher Precision
- Lower Recall
What is F1 Score?
Sometimes we need a balance between Precision and Recall.
F1 Score combines both into a single metric.
Formula:
F1=2×Precision+RecallPrecision×Recall
Why Not Use Average?
Suppose:
Precision:
100%
Recall:
0%
Average:
50%
This looks acceptable.
However:
The model is actually useless.
F1 uses the harmonic mean, which penalizes extreme imbalance.
Example Calculation
Precision:
0.8Recall:
0.6F1:
2×0.8+0.60.8×0.6 0.686F1 Score Interpretation
High F1 means:
- High Precision
- High Recall
Balanced performance.
F1 Score Range
| Value | Interpretation |
|---|---|
| 1.0 | Perfect |
| 0.8 | Very Good |
| 0.5 | Moderate |
| 0.0 | Poor |
Comparing Metrics
Suppose:
| Metric | Value |
|---|---|
| Accuracy | 95% |
| Precision | 50% |
| Recall | 40% |
| F1 Score | 44% |
Despite high accuracy,
the classifier performs poorly.
Example: Disease Detection
Confusion Matrix:
| Actual / Predicted | Disease | Healthy |
|---|---|---|
| Disease | 90 | 10 |
| Healthy | 20 | 180 |
Precision:
90+2090=81.8%Recall:
90+1090=90%F1 Score:
85.7%This model performs well.
Example: Spam Detection
Suppose:
Precision:
95%
Recall:
60%
Interpretation:
Most detected spam emails are truly spam.
However:
Many spam emails are still reaching the inbox.
Choosing the Right Metric
When Precision Matters Most
Examples:
- Spam Detection
- Loan Approval
- Search Results
Goal:
Avoid false positives.
When Recall Matters Most
Examples:
- Cancer Detection
- Fraud Detection
- Intrusion Detection
Goal:
Avoid false negatives.
When F1 Score Matters Most
Examples:
- Imbalanced Datasets
- General Classification Problems
- Production ML Systems
Goal:
Balance precision and recall.
Python Implementation
Precision:
from sklearn.metrics import precision_score
precision_score(
y_true,
y_pred
)
Recall:
from sklearn.metrics import recall_score
recall_score(
y_true,
y_pred
)
F1 Score:
from sklearn.metrics import f1_score
f1_score(
y_true,
y_pred
)
Classification Report
Scikit-Learn provides all metrics together.
from sklearn.metrics import classification_report
print(
classification_report(
y_true,
y_pred
)
)
Example Output:
Precision: 0.88
Recall: 0.84
F1 Score: 0.86
Real-World Applications
Healthcare
High Recall preferred.
Missing a disease is costly.
Fraud Detection
High Recall preferred.
Missing fraud is expensive.
Search Engines
High Precision preferred.
Users want relevant results.
Recommendation Systems
Balanced Precision and Recall often desired.
Common Mistakes
Using Accuracy Alone
Accuracy can be misleading.
Ignoring Business Context
Different applications require different priorities.
Chasing Precision Only
High precision with low recall may miss important cases.
Chasing Recall Only
High recall with low precision may generate too many false alarms.
Best Practices
- Always analyze the confusion matrix first
- Calculate Precision and Recall together
- Use F1 Score when classes are imbalanced
- Select metrics based on business requirements
- Evaluate models on unseen test data
Precision, Recall, and F1 Score Summary
| Metric | Formula | Focus |
|---|---|---|
| Precision | TP / (TP + FP) | Prediction Quality |
| Recall | TP / (TP + FN) | Positive Detection |
| F1 Score | Harmonic Mean | Balance |
Evaluation Workflow
- Build confusion matrix
- Calculate Precision
- Calculate Recall
- Calculate F1 Score
- Compare models
- Select best model
- Optimize threshold if needed
Why Precision, Recall, and F1 Score are Important
Precision, Recall, and F1 Score provide a much deeper understanding of classification performance than accuracy alone. They help reveal whether a model is generating false alarms, missing important cases, or maintaining a healthy balance between both.
These metrics are essential because real-world Machine Learning systems often operate on imbalanced datasets where accuracy can be misleading. Understanding these measures enables practitioners to design models that align with business goals and make more reliable decisions.
In the next article, we will explore ROC Curves and AUC, powerful evaluation tools that analyze classifier performance across different probability thresholds rather than relying on a single threshold such as 0.5.