One of the most important steps in the Machine Learning pipeline is dividing data into separate subsets before training a model. This process is known as Train-Test-Validation Splitting.

A Machine Learning model should not only perform well on the data it has seen during training but also on new, unseen data. If a model is evaluated on the same data used for training, the evaluation results become overly optimistic and misleading.

Train-Test-Validation splitting helps us measure how well a model generalizes to unseen data and prevents overfitting.

Every Machine Learning project, whether it involves:

  • Classification
  • Regression
  • Deep Learning
  • Computer Vision
  • NLP
  • Recommendation Systems

uses some form of data splitting before model training.

In this article, we will understand why data splitting is necessary, how training, validation, and test sets work, common splitting strategies, and best practices.

Why Do We Need Data Splitting?

Suppose we have a dataset containing 10,000 student records.

If we train a model and evaluate it on the same 10,000 records:

Training Accuracy:

99%

This looks impressive.

However, the real question is:

How will the model perform on completely new students?

Without separate evaluation data, we cannot answer this question.

Understanding Generalization

The ultimate goal of Machine Learning is:

Learning patterns that generalize to unseen data.

Good Model:

  • Performs well on training data
  • Performs well on unseen data

Poor Model:

  • Memorizes training data
  • Performs poorly on new data

Train-Test-Validation splitting helps measure generalization.

What is a Dataset?

A dataset usually contains:

FeaturesTarget
Age, SalaryPurchased
Height, WeightDisease Risk
PixelsImage Class

Before training, this dataset is divided into separate subsets.

The Three Data Splits

A typical Machine Learning workflow uses:

  1. Training Set
  2. Validation Set
  3. Test Set

Training Set

The Training Set is used to teach the model.

The model learns:

  • Patterns
  • Relationships
  • Parameters
  • Weights

Example:

If a dataset contains 10,000 samples:

Training Set:

7,000 samples

The model only learns from this data.

Validation Set

The Validation Set is used during model development.

It helps:

  • Tune hyperparameters
  • Compare models
  • Prevent overfitting

The model does not learn from validation data.

Instead, validation data helps evaluate model performance while training.

Test Set

The Test Set is used only after model development is complete.

Purpose:

  • Final performance evaluation
  • Simulating real-world unseen data

The Test Set should remain untouched until the very end.

Visualizing the Split

Dataset:

10,000 Samples

Typical split:

DatasetPercentage
Training70%
Validation15%
Test15%

Example:

DatasetSamples
Training7000
Validation1500
Test1500

Why Can't We Use Only Training Data?

Consider:

Training Accuracy:

99%

Test Accuracy:

72%

This indicates:

Overfitting

The model memorized training data instead of learning general patterns.

Without a test set, we would never detect this problem.

What is Overfitting?

Overfitting occurs when a model learns:

  • Noise
  • Random fluctuations
  • Dataset-specific details

instead of actual patterns.

Example:

Training Accuracy:

99%

Validation Accuracy:

75%

The large gap indicates overfitting.

What is Underfitting?

Underfitting occurs when a model fails to learn meaningful patterns.

Example:

Training Accuracy:

60%

Validation Accuracy:

58%

Both are low.

The model is too simple.

Role of Validation Data

Validation data helps answer questions such as:

  • Which algorithm should we choose?
  • What learning rate should we use?
  • What value of K should KNN use?
  • How many layers should a neural network have?

These decisions are called:

Hyperparameter Tuning

Model Parameters vs Hyperparameters

ParametersHyperparameters
Learned during trainingSet before training
WeightsLearning Rate
CoefficientsNumber of Trees
BiasesBatch Size

Validation data helps select optimal hyperparameters.

Common Split Ratios

There is no universal rule.

Common choices include:

TrainingValidationTest
60%20%20%
70%15%15%
80%10%10%

The choice depends on dataset size.

Small Dataset Example

Dataset:

1000 samples

Possible split:

DatasetSamples
Train700
Validation150
Test150

Large Dataset Example

Dataset:

1,000,000 samples

Possible split:

DatasetSamples
Train900,000
Validation50,000
Test50,000

Large datasets require smaller validation and test percentages.

Train-Test Split in Python

Scikit-learn provides a simple function.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)

Explanation:

  • 80% Training Data
  • 20% Test Data

Creating a Validation Set

X_train, X_temp, y_train, y_temp = train_test_split(
X,
y,
test_size=0.3,
random_state=42
)

X_val, X_test, y_val, y_test = train_test_split(
X_temp,
y_temp,
test_size=0.5,
random_state=42
)

Result:

  • 70% Training
  • 15% Validation
  • 15% Test

What is Random Splitting?

Random splitting randomly distributes samples into subsets.

Example:

Dataset:

Sample
A
B
C
D

Random split:

Training:

A, C

Testing:

B, D

Most Machine Learning projects use random splitting.

What is Stratified Splitting?

When classes are imbalanced, random splitting may create biased subsets.

Example:

ClassCount
Positive90
Negative10

Random splitting may accidentally produce:

Test Set:

| Positive | 19 |
| Negative | 1 |

This creates imbalance.

Stratified Split

Stratified splitting preserves class proportions.

Python:

train_test_split(
X,
y,
stratify=y,
test_size=0.2
)

Benefits:

  • Better class distribution
  • More reliable evaluation

Example

Original Dataset:

ClassPercentage
Positive90%
Negative10%

After stratification:

Training:

90%-10%

Testing:

90%-10%

Time Series Data Splitting

Time series datasets require special handling.

Incorrect:

Random splitting

Because future information may leak into training.

Correct Time Series Split

Example:

YearData Usage
2020Training
2021Training
2022Validation
2023Testing

Always preserve chronological order.

Why Time Series Splits Are Different

Future events should never be used to predict past events.

Violating this creates:

Data Leakage

What is Data Leakage?

Data leakage occurs when information unavailable during prediction accidentally enters the training process.

Example:

Using future sales values to predict current sales.

This creates unrealistically high accuracy.

Common Data Leakage Example

Incorrect Workflow:

Scale entire dataset
Split dataset
Train model

Test data influences scaling.

Correct Workflow:

Split dataset
Fit scaler on training data
Transform training data
Transform test data

Holdout Validation

The simplest evaluation approach.

Workflow:

  1. Split dataset once
  2. Train model
  3. Evaluate on validation/test set

Advantages:

  • Fast
  • Easy

Disadvantages:

  • Performance depends on one split

Cross-Validation

Cross-validation provides more reliable evaluation.

Dataset:

1000 samples

Split into:

5 folds

Workflow:

  • Train on 4 folds
  • Validate on 1 fold
  • Repeat 5 times

Average performance is reported.

K-Fold Cross Validation

Example:

K = 5

Fold 1:

Train: 80%

Validate: 20%

Repeat until every fold serves as validation once.

K-Fold Cross Validation in Python

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
model,
X,
y,
cv=5
)

print(scores)

Advantages of Cross-Validation

  • Better performance estimation
  • Reduced variance
  • More reliable evaluation

Disadvantages of Cross-Validation

  • Computationally expensive
  • Slower on large datasets

Stratified K-Fold

Combines:

  • Cross-validation
  • Stratification

Useful for classification problems with imbalanced classes.

Leave-One-Out Cross Validation (LOOCV)

Extreme version of cross-validation.

If dataset contains:

100 samples

Then:

99 samples → Training

1 sample → Validation

Repeated 100 times.

Advantages:

  • Maximum training data

Disadvantages:

  • Very slow

Comparing Evaluation Methods

MethodReliabilitySpeed
Holdout SplitModerateFast
K-Fold CVHighModerate
Stratified K-FoldVery HighModerate
LOOCVVery HighSlow

Real-World Example

Suppose a company wants to predict customer churn.

Dataset:

100,000 customers

Split:

DatasetSamples
Train70,000
Validation15,000
Test15,000

Workflow:

  1. Train model on 70,000 records
  2. Tune hyperparameters using validation data
  3. Evaluate final model using test data

Best Practices

  • Always split data before preprocessing
  • Keep the test set untouched
  • Use stratified splits for classification
  • Use chronological splits for time series
  • Use cross-validation for reliable evaluation
  • Avoid data leakage
  • Save the final test set for last evaluation

Common Mistakes

Using Test Data During Training

Incorrect:

Train → Test → Modify Model → Test Again

The test set becomes part of the training process.

Tuning Hyperparameters on Test Data

Validation data should be used for tuning.

Test data should only be used once.

Random Split for Time Series

Time series data requires chronological splitting.

Random splitting causes leakage.

Train-Test-Validation Split Workflow

A typical Machine Learning workflow is:

  1. Collect data
  2. Split into Train, Validation, Test
  3. Preprocess training data
  4. Train model
  5. Tune using validation data
  6. Select best model
  7. Evaluate on test data
  8. Deploy model

Why Train-Test-Validation Splitting is Essential

Without proper data splitting, model evaluation becomes unreliable and often overly optimistic. A model that appears highly accurate during training may fail completely when exposed to real-world data.

Train-Test-Validation splitting provides an unbiased estimate of model performance, helps prevent overfitting, enables hyperparameter tuning, and ensures that Machine Learning systems generalize effectively to unseen data. It is one of the most fundamental concepts in the entire Machine Learning workflow.