Model Evaluation Basics

Last updated: Jun 22, 2026

Author :

Vinay Adari

Model Evaluation Basics

Building a Machine Learning model is only half the job — you also need to measure how well it performs, and measure it the right way. This is called model evaluation. A model that looks impressive on the data it trained on may completely fail on new data, so evaluation is what tells us whether a model is actually useful.

The golden rule of evaluation is simple: always test a model on data it has never seen before. Judging a model only on its training data is like grading students on the exact questions they practised — it tells you nothing about real understanding.

💡 In one line: Model evaluation measures how well a model performs on unseen data, using the right metrics for the task.

Splitting the Data

To evaluate fairly, we divide our data into separate parts:

Training set — used to learn (usually the largest portion).
Validation set — used to tune settings and compare models during development.
Test set — used once, at the end, for a final, unbiased check of performance.

The test set is kept completely hidden during training, so the score it produces reflects how the model will behave on real, new data.

Classification Metrics

For classification (predicting categories), most metrics come from the confusion matrix — a table comparing predictions against the actual answers:

True Positive (TP) — correctly predicted positive
True Negative (TN) — correctly predicted negative
False Positive (FP) — predicted positive, but actually negative
False Negative (FN) — predicted negative, but actually positive

From these four values we derive the key metrics:

Metric	What it measures	Formula	Use when
Accuracy	Overall fraction correct	(TP + TN) / Total	Classes are balanced
Precision	Of predicted positives, how many were right	TP / (TP + FP)	False positives are costly
Recall	Of actual positives, how many we caught	TP / (TP + FN)	False negatives are costly
F1 Score	Balance of precision and recall	2 × (P × R) / (P + R)	You need both to be good

⚠️ Accuracy can mislead. If 99% of emails are not spam, a model that labels everything "not spam" is 99% accurate but useless. With imbalanced data, precision, recall, and F1 matter far more.

A Simple Example

Suppose a model is tested on 100 emails and produces this confusion matrix:

	Predicted Spam	Predicted Not Spam
Actually Spam	40 (TP)	10 (FN)
Actually Not Spam	5 (FP)	45 (TN)

Accuracy = (40 + 45) / 100 = 85%
Precision = 40 / (40 + 5) = 89%
Recall = 40 / (40 + 10) = 80%

So the model is fairly precise (few false alarms) but misses some real spam (lower recall).

Regression Metrics

For regression (predicting numbers), we measure how far predictions are from the true values:

Metric	Meaning
MAE (Mean Absolute Error)	Average size of the errors
MSE (Mean Squared Error)	Average of squared errors — punishes big mistakes more
RMSE (Root Mean Squared Error)	MSE back in the original units — easy to interpret
R² (R-squared)	How much of the variation the model explains (1.0 = perfect)

Cross-Validation

A single train/test split can be lucky or unlucky depending on how the data happened to divide. K-fold cross-validation fixes this:

Split the data into k equal parts (folds).
Train on k − 1 folds and test on the remaining one.
Repeat so every fold is used as the test set once.
Average the results.

This gives a much more reliable estimate of performance, because the model is tested on every part of the data.

Checking for Overfitting

Evaluation is also how we catch overfitting: compare the score on the training set with the score on the test set.

Both scores high → good fit ✅
Training high, test low → overfitting ⚠️
Both scores low → underfitting ⚠️

Summary

Model evaluation measures performance on unseen data — never judge a model by its training data alone.
Data is split into training, validation, and test sets, with the test set used only once for a final check.
Classification is measured with accuracy, precision, recall, and F1, all derived from the confusion matrix.
Regression is measured with MAE, MSE, RMSE, and R².
Cross-validation gives a more reliable estimate, and comparing train vs. test scores reveals overfitting or underfitting.