One of the most important steps in the Machine Learning pipeline is dividing data into separate subsets before training a model. This process is known as Train-Test-Validation Splitting.
A Machine Learning model should not only perform well on the data it has seen during training but also on new, unseen data. If a model is evaluated on the same data used for training, the evaluation results become overly optimistic and misleading.
Train-Test-Validation splitting helps us measure how well a model generalizes to unseen data and prevents overfitting.
Every Machine Learning project, whether it involves:
- Classification
- Regression
- Deep Learning
- Computer Vision
- NLP
- Recommendation Systems
uses some form of data splitting before model training.
In this article, we will understand why data splitting is necessary, how training, validation, and test sets work, common splitting strategies, and best practices.
Why Do We Need Data Splitting?
Suppose we have a dataset containing 10,000 student records.
If we train a model and evaluate it on the same 10,000 records:
Training Accuracy:
99%
This looks impressive.
However, the real question is:
How will the model perform on completely new students?
Without separate evaluation data, we cannot answer this question.
Understanding Generalization
The ultimate goal of Machine Learning is:
Learning patterns that generalize to unseen data.
Good Model:
- Performs well on training data
- Performs well on unseen data
Poor Model:
- Memorizes training data
- Performs poorly on new data
Train-Test-Validation splitting helps measure generalization.
What is a Dataset?
A dataset usually contains:
| Features | Target |
|---|---|
| Age, Salary | Purchased |
| Height, Weight | Disease Risk |
| Pixels | Image Class |
Before training, this dataset is divided into separate subsets.
The Three Data Splits
A typical Machine Learning workflow uses:
- Training Set
- Validation Set
- Test Set
Training Set
The Training Set is used to teach the model.
The model learns:
- Patterns
- Relationships
- Parameters
- Weights
Example:
If a dataset contains 10,000 samples:
Training Set:
7,000 samples
The model only learns from this data.
Validation Set
The Validation Set is used during model development.
It helps:
- Tune hyperparameters
- Compare models
- Prevent overfitting
The model does not learn from validation data.
Instead, validation data helps evaluate model performance while training.
Test Set
The Test Set is used only after model development is complete.
Purpose:
- Final performance evaluation
- Simulating real-world unseen data
The Test Set should remain untouched until the very end.
Visualizing the Split
Dataset:
10,000 Samples
Typical split:
| Dataset | Percentage |
|---|---|
| Training | 70% |
| Validation | 15% |
| Test | 15% |
Example:
| Dataset | Samples |
|---|---|
| Training | 7000 |
| Validation | 1500 |
| Test | 1500 |
Why Can't We Use Only Training Data?
Consider:
Training Accuracy:
99%
Test Accuracy:
72%
This indicates:
Overfitting
The model memorized training data instead of learning general patterns.
Without a test set, we would never detect this problem.
What is Overfitting?
Overfitting occurs when a model learns:
- Noise
- Random fluctuations
- Dataset-specific details
instead of actual patterns.
Example:
Training Accuracy:
99%
Validation Accuracy:
75%
The large gap indicates overfitting.
What is Underfitting?
Underfitting occurs when a model fails to learn meaningful patterns.
Example:
Training Accuracy:
60%
Validation Accuracy:
58%
Both are low.
The model is too simple.
Role of Validation Data
Validation data helps answer questions such as:
- Which algorithm should we choose?
- What learning rate should we use?
- What value of K should KNN use?
- How many layers should a neural network have?
These decisions are called:
Hyperparameter Tuning
Model Parameters vs Hyperparameters
| Parameters | Hyperparameters |
|---|---|
| Learned during training | Set before training |
| Weights | Learning Rate |
| Coefficients | Number of Trees |
| Biases | Batch Size |
Validation data helps select optimal hyperparameters.
Common Split Ratios
There is no universal rule.
Common choices include:
| Training | Validation | Test |
|---|---|---|
| 60% | 20% | 20% |
| 70% | 15% | 15% |
| 80% | 10% | 10% |
The choice depends on dataset size.
Small Dataset Example
Dataset:
1000 samples
Possible split:
| Dataset | Samples |
|---|---|
| Train | 700 |
| Validation | 150 |
| Test | 150 |
Large Dataset Example
Dataset:
1,000,000 samples
Possible split:
| Dataset | Samples |
|---|---|
| Train | 900,000 |
| Validation | 50,000 |
| Test | 50,000 |
Large datasets require smaller validation and test percentages.
Train-Test Split in Python
Scikit-learn provides a simple function.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
Explanation:
- 80% Training Data
- 20% Test Data
Creating a Validation Set
X_train, X_temp, y_train, y_temp = train_test_split(
X,
y,
test_size=0.3,
random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp,
y_temp,
test_size=0.5,
random_state=42
)
Result:
- 70% Training
- 15% Validation
- 15% Test
What is Random Splitting?
Random splitting randomly distributes samples into subsets.
Example:
Dataset:
| Sample |
|---|
| A |
| B |
| C |
| D |
Random split:
Training:
A, C
Testing:
B, D
Most Machine Learning projects use random splitting.
What is Stratified Splitting?
When classes are imbalanced, random splitting may create biased subsets.
Example:
| Class | Count |
|---|---|
| Positive | 90 |
| Negative | 10 |
Random splitting may accidentally produce:
Test Set:
| Positive | 19 |
| Negative | 1 |
This creates imbalance.
Stratified Split
Stratified splitting preserves class proportions.
Python:
train_test_split(
X,
y,
stratify=y,
test_size=0.2
)
Benefits:
- Better class distribution
- More reliable evaluation
Example
Original Dataset:
| Class | Percentage |
|---|---|
| Positive | 90% |
| Negative | 10% |
After stratification:
Training:
90%-10%
Testing:
90%-10%
Time Series Data Splitting
Time series datasets require special handling.
Incorrect:
Random splitting
Because future information may leak into training.
Correct Time Series Split
Example:
| Year | Data Usage |
|---|---|
| 2020 | Training |
| 2021 | Training |
| 2022 | Validation |
| 2023 | Testing |
Always preserve chronological order.
Why Time Series Splits Are Different
Future events should never be used to predict past events.
Violating this creates:
Data Leakage
What is Data Leakage?
Data leakage occurs when information unavailable during prediction accidentally enters the training process.
Example:
Using future sales values to predict current sales.
This creates unrealistically high accuracy.
Common Data Leakage Example
Incorrect Workflow:
Scale entire dataset
Split dataset
Train model
Test data influences scaling.
Correct Workflow:
Split dataset
Fit scaler on training data
Transform training data
Transform test data
Holdout Validation
The simplest evaluation approach.
Workflow:
- Split dataset once
- Train model
- Evaluate on validation/test set
Advantages:
- Fast
- Easy
Disadvantages:
- Performance depends on one split
Cross-Validation
Cross-validation provides more reliable evaluation.
Dataset:
1000 samples
Split into:
5 folds
Workflow:
- Train on 4 folds
- Validate on 1 fold
- Repeat 5 times
Average performance is reported.
K-Fold Cross Validation
Example:
K = 5
Fold 1:
Train: 80%
Validate: 20%
Repeat until every fold serves as validation once.
K-Fold Cross Validation in Python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
model,
X,
y,
cv=5
)
print(scores)
Advantages of Cross-Validation
- Better performance estimation
- Reduced variance
- More reliable evaluation
Disadvantages of Cross-Validation
- Computationally expensive
- Slower on large datasets
Stratified K-Fold
Combines:
- Cross-validation
- Stratification
Useful for classification problems with imbalanced classes.
Leave-One-Out Cross Validation (LOOCV)
Extreme version of cross-validation.
If dataset contains:
100 samples
Then:
99 samples → Training
1 sample → Validation
Repeated 100 times.
Advantages:
- Maximum training data
Disadvantages:
- Very slow
Comparing Evaluation Methods
| Method | Reliability | Speed |
|---|---|---|
| Holdout Split | Moderate | Fast |
| K-Fold CV | High | Moderate |
| Stratified K-Fold | Very High | Moderate |
| LOOCV | Very High | Slow |
Real-World Example
Suppose a company wants to predict customer churn.
Dataset:
100,000 customers
Split:
| Dataset | Samples |
|---|---|
| Train | 70,000 |
| Validation | 15,000 |
| Test | 15,000 |
Workflow:
- Train model on 70,000 records
- Tune hyperparameters using validation data
- Evaluate final model using test data
Best Practices
- Always split data before preprocessing
- Keep the test set untouched
- Use stratified splits for classification
- Use chronological splits for time series
- Use cross-validation for reliable evaluation
- Avoid data leakage
- Save the final test set for last evaluation
Common Mistakes
Using Test Data During Training
Incorrect:
Train → Test → Modify Model → Test Again
The test set becomes part of the training process.
Tuning Hyperparameters on Test Data
Validation data should be used for tuning.
Test data should only be used once.
Random Split for Time Series
Time series data requires chronological splitting.
Random splitting causes leakage.
Train-Test-Validation Split Workflow
A typical Machine Learning workflow is:
- Collect data
- Split into Train, Validation, Test
- Preprocess training data
- Train model
- Tune using validation data
- Select best model
- Evaluate on test data
- Deploy model
Why Train-Test-Validation Splitting is Essential
Without proper data splitting, model evaluation becomes unreliable and often overly optimistic. A model that appears highly accurate during training may fail completely when exposed to real-world data.
Train-Test-Validation splitting provides an unbiased estimate of model performance, helps prevent overfitting, enables hyperparameter tuning, and ensures that Machine Learning systems generalize effectively to unseen data. It is one of the most fundamental concepts in the entire Machine Learning workflow.