Overfitting and Underfitting in Machine Learning

Last updated: Jun 13, 2026

Author :

Christy Harshitha Dakarapu

One of the biggest challenges in Machine Learning is not building a model that performs well on the training data—it is building a model that performs well on unseen data.

Imagine a student preparing for an exam.

One student memorizes every question from previous papers without understanding the concepts.

Another student studies only a few topics and ignores the rest.

A third student learns the concepts thoroughly and can solve new problems confidently.

These three situations closely resemble:

Overfitting
Underfitting
Good Generalization

Understanding these concepts is crucial because they determine whether a Machine Learning model will succeed in real-world applications.

In this article, we will explore Overfitting and Underfitting in detail, understand their causes, learn how to detect them, and study techniques to build models that generalize well.

Why Machine Learning Models Fail

A Machine Learning model learns patterns from training data.

The goal is:


Training Data
       ↓
Learn Patterns
       ↓
Perform Well on New Data

However, models often fail because they either:

Learn too little
Learn too much

These situations are called:

Underfitting
Overfitting

What is Underfitting?

Underfitting occurs when a model is too simple to capture the underlying patterns in the data.

The model fails to learn important relationships.

As a result:

Poor training performance
Poor testing performance

Student Analogy

Imagine a student who studies only one chapter for a semester-long exam.

The student lacks sufficient knowledge.

Result:

Poor performance everywhere.

This is Underfitting.

Example of Underfitting

Suppose house prices depend on:

Area
Bedrooms
Location

But the model only uses:

Area

The model may miss critical information.

Predictions become inaccurate.

Characteristics of Underfitting

High training error
High testing error
Poor predictions
Oversimplified model

Visualizing Underfitting

Actual relationship:


      *
   *     *
 *         *
   *     *
      *

Underfit model:


----------------

The model cannot capture the pattern.

Causes of Underfitting

Common causes:

Model too simple
Insufficient features
Excessive regularization
Insufficient training time
Poor feature engineering

Example

Suppose:

Actual relationship:

y=x^2

Model:

y=mx+b

A straight line cannot capture the curve properly.

This leads to underfitting.

What is Overfitting?

Overfitting occurs when a model learns the training data too well, including:

Noise
Random fluctuations
Outliers

Instead of learning general patterns, the model memorizes the training dataset.

Student Analogy

Imagine a student who memorizes previous exam questions word-for-word.

If the exam changes slightly:

Performance drops dramatically.

This is Overfitting.

Characteristics of Overfitting

Very low training error
High testing error
Excellent training performance
Poor generalization

Visualizing Overfitting

Actual pattern:


   *
 *   *
*     *
 *   *
   *

Overfit model:


   /\
  /  \__
 /      \
/ /\   /\

The model tries to pass through every point.

Why Overfitting Happens

The model starts learning:

Genuine patterns
Random noise

Noise should not be learned because it does not generalize.

Example

Suppose:

House price data contains one unusual house.

Instead of ignoring it, an overfit model adjusts itself specifically for that house.

This reduces training error but hurts future predictions.

What is Generalization?

Generalization refers to a model's ability to perform well on unseen data.

A good model should:

Learn actual patterns
Ignore noise
Predict accurately on new data

This is the ultimate goal of Machine Learning.

The Ideal Situation


Training Error → Low
Testing Error  → Low

The model captures meaningful patterns without memorization.

Comparing Underfitting and Overfitting

Aspect	Underfitting	Overfitting
Training Error	High	Very Low
Testing Error	High	High
Complexity	Too Simple	Too Complex
Learns Patterns	Poorly	Excessively
Generalization	Poor	Poor

Training Error vs Testing Error

Understanding both errors is critical.

Training Error

Error on training data.

Testing Error

Error on unseen data.

Underfitting Example

Metric	Value
Training Error	High
Testing Error	High

Model cannot learn.

Overfitting Example

Metric	Value
Training Error	Very Low
Testing Error	High

Model memorizes training data.

Good Model Example

Metric	Value
Training Error	Low
Testing Error	Low

Model generalizes well.

Model Complexity

Model complexity strongly influences fitting behavior.

Low Complexity

Example:

Simple Linear Regression

Risk:

Underfitting

High Complexity

Example:

Deep Decision Trees

Risk:

Overfitting

Complexity Visualization


Too Simple
     ↓
Underfitting

Optimal
     ↓
Good Fit

Too Complex
     ↓
Overfitting

Bias and Variance

Overfitting and Underfitting are closely related to:

Bias
Variance

What is Bias?

Bias measures error caused by simplifying assumptions.

High Bias:

Model misses important relationships.

Associated with:

Underfitting

What is Variance?

Variance measures sensitivity to training data changes.

High Variance:

Model changes significantly with different datasets.

Associated with:

Overfitting

Bias-Variance Tradeoff

One of the most important concepts in Machine Learning.


High Bias
     ↓
Underfitting

Optimal Balance
     ↓
Best Model

High Variance
     ↓
Overfitting

Learning Curve Intuition

Training Error:

Usually increases slightly as complexity grows.

Testing Error:

Initially decreases and later increases.

Visualization:


Error
 ^
 |
 |\
 | \
 |  \
 |   \__
 |       \__
 +------------>
 Model Complexity

Lowest testing error represents the ideal complexity.

Detecting Underfitting

Signs:

Poor training performance
Poor validation performance
Simple model structure

Example:

R² Score:

0.20

Model captures little information.

Detecting Overfitting

Signs:

Training Accuracy:

99%

Test Accuracy:

75%

Large performance gap indicates overfitting.

Example

Dataset	Accuracy
Training	99%
Test	72%

Strong indication of overfitting.

How Train-Test Split Helps

Train-test split helps evaluate generalization.

Workflow:


Training Data
       ↓
Train Model
       ↓
Test Data
       ↓
Evaluate

Performance differences reveal fitting problems.

Cross Validation

Cross Validation provides a more reliable evaluation.

Example:

5-Fold Cross Validation


Train
Train
Train
Train
Test

Repeated multiple times.

Benefits:

Better performance estimates
Detects overfitting more reliably

Causes of Overfitting

Common causes include:

Too many features
Small dataset
Excessive model complexity
Noise in data
Long training duration

Causes of Underfitting

Common causes include:

Too few features
Simple models
Insufficient training
Excessive regularization

Reducing Underfitting

Strategies:

Add more features
Increase model complexity
Train longer
Reduce regularization
Improve feature engineering

Reducing Overfitting

Strategies:

Collect more data
Remove irrelevant features
Use regularization
Apply cross validation
Reduce model complexity
Early stopping

Feature Engineering and Fitting

Good features often reduce both:

Underfitting
Overfitting

Useful features help models learn meaningful patterns.

Regularization

Regularization discourages overly complex models.

Popular techniques:

Ridge Regression
Lasso Regression
Elastic Net

These will be covered in a later article.

Overfitting in Decision Trees

Example:

Very deep tree.

The tree memorizes every training example.

Result:

Perfect training accuracy.

Poor test accuracy.

Overfitting in Neural Networks

Neural Networks with millions of parameters can easily memorize data.

Common solutions:

Dropout
Early Stopping
Data Augmentation

Real-World Example

Suppose a company predicts employee attrition.

Features:

Salary
Experience
Department
Work Hours

Underfit Model:

Uses only salary.

Misses important factors.

Overfit Model:

Learns random employee-specific patterns.

Good Model:

Learns meaningful relationships that generalize.

Common Mistakes

Using Training Accuracy Only

A model with 100% training accuracy may perform poorly in production.

Assuming Complex Models Are Always Better

More complexity often increases overfitting risk.

Ignoring Validation Data

Validation data helps detect fitting issues early.

Best Practices

Always evaluate on unseen data
Use train-test-validation splits
Monitor both training and test performance
Apply cross validation
Use regularization when necessary
Balance model complexity carefully

Overfitting vs Underfitting Summary

Characteristic	Underfitting	Good Fit	Overfitting
Training Error	High	Low	Very Low
Testing Error	High	Low	High
Complexity	Too Low	Balanced	Too High
Generalization	Poor	Good	Poor

Model Development Workflow

A typical workflow is:

Train model
Measure training performance
Measure test performance
Compare results
Detect fitting issues
Adjust complexity
Repeat until balanced

Why Understanding Overfitting and Underfitting is Important

The primary goal of Machine Learning is not to memorize historical data but to make accurate predictions on unseen data. Underfitting prevents a model from learning meaningful patterns, while overfitting causes it to learn too much, including noise.

The most successful Machine Learning models strike a balance between these extremes by capturing genuine relationships while ignoring randomness. Understanding Overfitting and Underfitting is essential because nearly every Machine Learning project involves managing this balance to achieve strong real-world performance.

In the next article, we will study Evaluation Metrics for Regression, which help us quantitatively measure how good or bad a regression model's predictions are.