Train-Test-Validation Split in Machine Learning

Last updated: Jun 11, 2026

Author :

Christy Harshitha Dakarapu

One of the most important steps in the Machine Learning pipeline is dividing data into separate subsets before training a model. This process is known as Train-Test-Validation Splitting.

A Machine Learning model should not only perform well on the data it has seen during training but also on new, unseen data. If a model is evaluated on the same data used for training, the evaluation results become overly optimistic and misleading.

Train-Test-Validation splitting helps us measure how well a model generalizes to unseen data and prevents overfitting.

Every Machine Learning project, whether it involves:

Classification
Regression
Deep Learning
Computer Vision
NLP
Recommendation Systems

uses some form of data splitting before model training.

In this article, we will understand why data splitting is necessary, how training, validation, and test sets work, common splitting strategies, and best practices.

Why Do We Need Data Splitting?

Suppose we have a dataset containing 10,000 student records.

If we train a model and evaluate it on the same 10,000 records:

Training Accuracy:

99%

This looks impressive.

However, the real question is:

How will the model perform on completely new students?

Without separate evaluation data, we cannot answer this question.

Understanding Generalization

The ultimate goal of Machine Learning is:

Learning patterns that generalize to unseen data.

Good Model:

Performs well on training data
Performs well on unseen data

Poor Model:

Memorizes training data
Performs poorly on new data

Train-Test-Validation splitting helps measure generalization.

What is a Dataset?

A dataset usually contains:

Features	Target
Age, Salary	Purchased
Height, Weight	Disease Risk
Pixels	Image Class

Before training, this dataset is divided into separate subsets.

The Three Data Splits

A typical Machine Learning workflow uses:

Training Set
Validation Set
Test Set

Training Set

The Training Set is used to teach the model.

The model learns:

Patterns
Relationships
Parameters
Weights

Example:

If a dataset contains 10,000 samples:

Training Set:

7,000 samples

The model only learns from this data.

Validation Set

The Validation Set is used during model development.

It helps:

Tune hyperparameters
Compare models
Prevent overfitting

The model does not learn from validation data.

Instead, validation data helps evaluate model performance while training.

Test Set

The Test Set is used only after model development is complete.

Purpose:

Final performance evaluation
Simulating real-world unseen data

The Test Set should remain untouched until the very end.

Visualizing the Split

Dataset:

10,000 Samples

Typical split:

Dataset	Percentage
Training	70%
Validation	15%
Test	15%

Example:

Dataset	Samples
Training	7000
Validation	1500
Test	1500

Why Can't We Use Only Training Data?

Consider:

Training Accuracy:

99%

Test Accuracy:

72%

This indicates:

Overfitting

The model memorized training data instead of learning general patterns.

Without a test set, we would never detect this problem.

What is Overfitting?

Overfitting occurs when a model learns:

Noise
Random fluctuations
Dataset-specific details

instead of actual patterns.

Example:

Training Accuracy:

99%

Validation Accuracy:

75%

The large gap indicates overfitting.

What is Underfitting?

Underfitting occurs when a model fails to learn meaningful patterns.

Example:

Training Accuracy:

60%

Validation Accuracy:

58%

Both are low.

The model is too simple.

Role of Validation Data

Validation data helps answer questions such as:

Which algorithm should we choose?
What learning rate should we use?
What value of K should KNN use?
How many layers should a neural network have?

These decisions are called:

Hyperparameter Tuning

Model Parameters vs Hyperparameters

Parameters	Hyperparameters
Learned during training	Set before training
Weights	Learning Rate
Coefficients	Number of Trees
Biases	Batch Size

Validation data helps select optimal hyperparameters.

Common Split Ratios

There is no universal rule.

Common choices include:

Training	Validation	Test
60%	20%	20%
70%	15%	15%
80%	10%	10%

The choice depends on dataset size.

Small Dataset Example

Dataset:

1000 samples

Possible split:

Dataset	Samples
Train	700
Validation	150
Test	150

Large Dataset Example

Dataset:

1,000,000 samples

Possible split:

Dataset	Samples
Train	900,000
Validation	50,000
Test	50,000

Large datasets require smaller validation and test percentages.

Train-Test Split in Python

Scikit-learn provides a simple function.


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

Explanation:

80% Training Data
20% Test Data

Creating a Validation Set


X_train, X_temp, y_train, y_temp = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp,
    y_temp,
    test_size=0.5,
    random_state=42
)

Result:

70% Training
15% Validation
15% Test

What is Random Splitting?

Random splitting randomly distributes samples into subsets.

Example:

Dataset:

Sample
A
B
C
D

Random split:

Training:

A, C

Testing:

B, D

Most Machine Learning projects use random splitting.

What is Stratified Splitting?

When classes are imbalanced, random splitting may create biased subsets.

Example:

Class	Count
Positive	90
Negative	10

Random splitting may accidentally produce:

Test Set:

| Positive | 19 |
| Negative | 1 |

This creates imbalance.

Stratified Split

Stratified splitting preserves class proportions.

Python:


train_test_split(
    X,
    y,
    stratify=y,
    test_size=0.2
)

Benefits:

Better class distribution
More reliable evaluation

Example

Original Dataset:

Class	Percentage
Positive	90%
Negative	10%

After stratification:

Training:

90%-10%

Testing:

90%-10%

Time Series Data Splitting

Time series datasets require special handling.

Incorrect:

Random splitting

Because future information may leak into training.

Correct Time Series Split

Example:

Year	Data Usage
2020	Training
2021	Training
2022	Validation
2023	Testing

Always preserve chronological order.

Why Time Series Splits Are Different

Future events should never be used to predict past events.

Violating this creates:

Data Leakage

What is Data Leakage?

Data leakage occurs when information unavailable during prediction accidentally enters the training process.

Example:

Using future sales values to predict current sales.

This creates unrealistically high accuracy.

Common Data Leakage Example

Incorrect Workflow:


Scale entire dataset
Split dataset
Train model

Test data influences scaling.

Correct Workflow:


Split dataset
Fit scaler on training data
Transform training data
Transform test data

Holdout Validation

The simplest evaluation approach.

Workflow:

Split dataset once
Train model
Evaluate on validation/test set

Advantages:

Fast
Easy

Disadvantages:

Performance depends on one split

Cross-Validation

Cross-validation provides more reliable evaluation.

Dataset:

1000 samples

Split into:

5 folds

Workflow:

Train on 4 folds
Validate on 1 fold
Repeat 5 times

Average performance is reported.

K-Fold Cross Validation

Example:

K = 5

Fold 1:

Train: 80%

Validate: 20%

Repeat until every fold serves as validation once.

K-Fold Cross Validation in Python


from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    model,
    X,
    y,
    cv=5
)

print(scores)

Advantages of Cross-Validation

Better performance estimation
Reduced variance
More reliable evaluation

Disadvantages of Cross-Validation

Computationally expensive
Slower on large datasets

Stratified K-Fold

Combines:

Cross-validation
Stratification

Useful for classification problems with imbalanced classes.

Leave-One-Out Cross Validation (LOOCV)

Extreme version of cross-validation.

If dataset contains:

100 samples

Then:

99 samples → Training

1 sample → Validation

Repeated 100 times.

Advantages:

Maximum training data

Disadvantages:

Very slow

Comparing Evaluation Methods

Method	Reliability	Speed
Holdout Split	Moderate	Fast
K-Fold CV	High	Moderate
Stratified K-Fold	Very High	Moderate
LOOCV	Very High	Slow

Real-World Example

Suppose a company wants to predict customer churn.

Dataset:

100,000 customers

Split:

Dataset	Samples
Train	70,000
Validation	15,000
Test	15,000

Workflow:

Train model on 70,000 records
Tune hyperparameters using validation data
Evaluate final model using test data

Best Practices

Always split data before preprocessing
Keep the test set untouched
Use stratified splits for classification
Use chronological splits for time series
Use cross-validation for reliable evaluation
Avoid data leakage
Save the final test set for last evaluation

Common Mistakes

Using Test Data During Training

Incorrect:


Train → Test → Modify Model → Test Again

The test set becomes part of the training process.

Tuning Hyperparameters on Test Data

Validation data should be used for tuning.

Test data should only be used once.

Random Split for Time Series

Time series data requires chronological splitting.

Random splitting causes leakage.

Train-Test-Validation Split Workflow

A typical Machine Learning workflow is:

Collect data
Split into Train, Validation, Test
Preprocess training data
Train model
Tune using validation data
Select best model
Evaluate on test data
Deploy model

Why Train-Test-Validation Splitting is Essential

Without proper data splitting, model evaluation becomes unreliable and often overly optimistic. A model that appears highly accurate during training may fail completely when exposed to real-world data.

Train-Test-Validation splitting provides an unbiased estimate of model performance, helps prevent overfitting, enables hyperparameter tuning, and ensures that Machine Learning systems generalize effectively to unseen data. It is one of the most fundamental concepts in the entire Machine Learning workflow.