Model Evaluation Basics
Building a Machine Learning model is only half the job — you also need to measure how well it performs, and measure it the right way. This is called model evaluation. A model that looks impressive on the data it trained on may completely fail on new data, so evaluation is what tells us whether a model is actually useful.
The golden rule of evaluation is simple: always test a model on data it has never seen before. Judging a model only on its training data is like grading students on the exact questions they practised — it tells you nothing about real understanding.
💡 In one line: Model evaluation measures how well a model performs on unseen data, using the right metrics for the task.
Splitting the Data
To evaluate fairly, we divide our data into separate parts:
- Training set — used to learn (usually the largest portion).
- Validation set — used to tune settings and compare models during development.
- Test set — used once, at the end, for a final, unbiased check of performance.
The test set is kept completely hidden during training, so the score it produces reflects how the model will behave on real, new data.
Classification Metrics
For classification (predicting categories), most metrics come from the confusion matrix — a table comparing predictions against the actual answers:
- True Positive (TP) — correctly predicted positive
- True Negative (TN) — correctly predicted negative
- False Positive (FP) — predicted positive, but actually negative
- False Negative (FN) — predicted negative, but actually positive
From these four values we derive the key metrics:
| Metric | What it measures | Formula | Use when |
|---|---|---|---|
| Accuracy | Overall fraction correct | (TP + TN) / Total | Classes are balanced |
| Precision | Of predicted positives, how many were right | TP / (TP + FP) | False positives are costly |
| Recall | Of actual positives, how many we caught | TP / (TP + FN) | False negatives are costly |
| F1 Score | Balance of precision and recall | 2 × (P × R) / (P + R) | You need both to be good |
⚠️ Accuracy can mislead. If 99% of emails are not spam, a model that labels everything "not spam" is 99% accurate but useless. With imbalanced data, precision, recall, and F1 matter far more.
A Simple Example
Suppose a model is tested on 100 emails and produces this confusion matrix:
| Predicted Spam | Predicted Not Spam | |
|---|---|---|
| Actually Spam | 40 (TP) | 10 (FN) |
| Actually Not Spam | 5 (FP) | 45 (TN) |
- Accuracy = (40 + 45) / 100 = 85%
- Precision = 40 / (40 + 5) = 89%
- Recall = 40 / (40 + 10) = 80%
So the model is fairly precise (few false alarms) but misses some real spam (lower recall).
Regression Metrics
For regression (predicting numbers), we measure how far predictions are from the true values:
| Metric | Meaning |
|---|---|
| MAE (Mean Absolute Error) | Average size of the errors |
| MSE (Mean Squared Error) | Average of squared errors — punishes big mistakes more |
| RMSE (Root Mean Squared Error) | MSE back in the original units — easy to interpret |
| R² (R-squared) | How much of the variation the model explains (1.0 = perfect) |
Cross-Validation
A single train/test split can be lucky or unlucky depending on how the data happened to divide. K-fold cross-validation fixes this:
- Split the data into k equal parts (folds).
- Train on k − 1 folds and test on the remaining one.
- Repeat so every fold is used as the test set once.
- Average the results.
This gives a much more reliable estimate of performance, because the model is tested on every part of the data.
Checking for Overfitting
Evaluation is also how we catch overfitting: compare the score on the training set with the score on the test set.
- Both scores high → good fit ✅
- Training high, test low → overfitting ⚠️
- Both scores low → underfitting ⚠️
Summary
- Model evaluation measures performance on unseen data — never judge a model by its training data alone.
- Data is split into training, validation, and test sets, with the test set used only once for a final check.
- Classification is measured with accuracy, precision, recall, and F1, all derived from the confusion matrix.
- Regression is measured with MAE, MSE, RMSE, and R².
- Cross-validation gives a more reliable estimate, and comparing train vs. test scores reveals overfitting or underfitting.