One of the biggest challenges in Machine Learning is not building a model that performs well on the training data—it is building a model that performs well on unseen data.

Imagine a student preparing for an exam.

One student memorizes every question from previous papers without understanding the concepts.

Another student studies only a few topics and ignores the rest.

A third student learns the concepts thoroughly and can solve new problems confidently.

These three situations closely resemble:

  • Overfitting
  • Underfitting
  • Good Generalization

Understanding these concepts is crucial because they determine whether a Machine Learning model will succeed in real-world applications.

In this article, we will explore Overfitting and Underfitting in detail, understand their causes, learn how to detect them, and study techniques to build models that generalize well.

Why Machine Learning Models Fail

A Machine Learning model learns patterns from training data.

The goal is:

Training Data

Learn Patterns

Perform Well on New Data

However, models often fail because they either:

  • Learn too little
  • Learn too much

These situations are called:

  • Underfitting
  • Overfitting

What is Underfitting?

Underfitting occurs when a model is too simple to capture the underlying patterns in the data.

The model fails to learn important relationships.

As a result:

  • Poor training performance
  • Poor testing performance

Student Analogy

Imagine a student who studies only one chapter for a semester-long exam.

The student lacks sufficient knowledge.

Result:

Poor performance everywhere.

This is Underfitting.

Example of Underfitting

Suppose house prices depend on:

  • Area
  • Bedrooms
  • Location

But the model only uses:

  • Area

The model may miss critical information.

Predictions become inaccurate.

Characteristics of Underfitting

  • High training error
  • High testing error
  • Poor predictions
  • Oversimplified model

Visualizing Underfitting

Actual relationship:

      *
* *
* *
* *
*

Underfit model:

----------------

The model cannot capture the pattern.

Causes of Underfitting

Common causes:

  • Model too simple
  • Insufficient features
  • Excessive regularization
  • Insufficient training time
  • Poor feature engineering

Example

Suppose:

Actual relationship:

y=x2y=x^2

Model:

y=mx+by=mx+b

A straight line cannot capture the curve properly.

This leads to underfitting.

What is Overfitting?

Overfitting occurs when a model learns the training data too well, including:

  • Noise
  • Random fluctuations
  • Outliers

Instead of learning general patterns, the model memorizes the training dataset.

Student Analogy

Imagine a student who memorizes previous exam questions word-for-word.

If the exam changes slightly:

Performance drops dramatically.

This is Overfitting.

Characteristics of Overfitting

  • Very low training error
  • High testing error
  • Excellent training performance
  • Poor generalization

Visualizing Overfitting

Actual pattern:

   *
* *
* *
* *
*

Overfit model:

   /\
/ \__
/ \
/ /\ /\

The model tries to pass through every point.

Why Overfitting Happens

The model starts learning:

  • Genuine patterns
  • Random noise

Noise should not be learned because it does not generalize.

Example

Suppose:

House price data contains one unusual house.

Instead of ignoring it, an overfit model adjusts itself specifically for that house.

This reduces training error but hurts future predictions.

What is Generalization?

Generalization refers to a model's ability to perform well on unseen data.

A good model should:

  • Learn actual patterns
  • Ignore noise
  • Predict accurately on new data

This is the ultimate goal of Machine Learning.

The Ideal Situation

Training Error → Low
Testing Error → Low

The model captures meaningful patterns without memorization.

Comparing Underfitting and Overfitting

AspectUnderfittingOverfitting
Training ErrorHighVery Low
Testing ErrorHighHigh
ComplexityToo SimpleToo Complex
Learns PatternsPoorlyExcessively
GeneralizationPoorPoor

Training Error vs Testing Error

Understanding both errors is critical.

Training Error

Error on training data.

Testing Error

Error on unseen data.

Underfitting Example

MetricValue
Training ErrorHigh
Testing ErrorHigh

Model cannot learn.

Overfitting Example

MetricValue
Training ErrorVery Low
Testing ErrorHigh

Model memorizes training data.

Good Model Example

MetricValue
Training ErrorLow
Testing ErrorLow

Model generalizes well.

Model Complexity

Model complexity strongly influences fitting behavior.

Low Complexity

Example:

Simple Linear Regression

Risk:

Underfitting

High Complexity

Example:

Deep Decision Trees

Risk:

Overfitting

Complexity Visualization

Too Simple

Underfitting

Optimal

Good Fit

Too Complex

Overfitting

Bias and Variance

Overfitting and Underfitting are closely related to:

  • Bias
  • Variance

What is Bias?

Bias measures error caused by simplifying assumptions.

High Bias:

Model misses important relationships.

Associated with:

Underfitting

What is Variance?

Variance measures sensitivity to training data changes.

High Variance:

Model changes significantly with different datasets.

Associated with:

Overfitting

Bias-Variance Tradeoff

One of the most important concepts in Machine Learning.

High Bias

Underfitting

Optimal Balance

Best Model

High Variance

Overfitting

Learning Curve Intuition

Training Error:

Usually increases slightly as complexity grows.

Testing Error:

Initially decreases and later increases.

Visualization:

Error
^
|
|\
| \
| \
| \__
| \__
+------------>
Model Complexity

Lowest testing error represents the ideal complexity.

Detecting Underfitting

Signs:

  • Poor training performance
  • Poor validation performance
  • Simple model structure

Example:

R² Score:

0.20

Model captures little information.

Detecting Overfitting

Signs:

Training Accuracy:

99%

Test Accuracy:

75%

Large performance gap indicates overfitting.

Example

DatasetAccuracy
Training99%
Test72%

Strong indication of overfitting.

How Train-Test Split Helps

Train-test split helps evaluate generalization.

Workflow:

Training Data

Train Model

Test Data

Evaluate

Performance differences reveal fitting problems.

Cross Validation

Cross Validation provides a more reliable evaluation.

Example:

5-Fold Cross Validation

Train
Train
Train
Train
Test

Repeated multiple times.

Benefits:

  • Better performance estimates
  • Detects overfitting more reliably

Causes of Overfitting

Common causes include:

  • Too many features
  • Small dataset
  • Excessive model complexity
  • Noise in data
  • Long training duration

Causes of Underfitting

Common causes include:

  • Too few features
  • Simple models
  • Insufficient training
  • Excessive regularization

Reducing Underfitting

Strategies:

  • Add more features
  • Increase model complexity
  • Train longer
  • Reduce regularization
  • Improve feature engineering

Reducing Overfitting

Strategies:

  • Collect more data
  • Remove irrelevant features
  • Use regularization
  • Apply cross validation
  • Reduce model complexity
  • Early stopping

Feature Engineering and Fitting

Good features often reduce both:

  • Underfitting
  • Overfitting

Useful features help models learn meaningful patterns.

Regularization

Regularization discourages overly complex models.

Popular techniques:

  • Ridge Regression
  • Lasso Regression
  • Elastic Net

These will be covered in a later article.

Overfitting in Decision Trees

Example:

Very deep tree.

The tree memorizes every training example.

Result:

Perfect training accuracy.

Poor test accuracy.

Overfitting in Neural Networks

Neural Networks with millions of parameters can easily memorize data.

Common solutions:

  • Dropout
  • Early Stopping
  • Data Augmentation

Real-World Example

Suppose a company predicts employee attrition.

Features:

  • Salary
  • Experience
  • Department
  • Work Hours

Underfit Model:

Uses only salary.

Misses important factors.

Overfit Model:

Learns random employee-specific patterns.

Good Model:

Learns meaningful relationships that generalize.

Common Mistakes

Using Training Accuracy Only

A model with 100% training accuracy may perform poorly in production.

Assuming Complex Models Are Always Better

More complexity often increases overfitting risk.

Ignoring Validation Data

Validation data helps detect fitting issues early.

Best Practices

  • Always evaluate on unseen data
  • Use train-test-validation splits
  • Monitor both training and test performance
  • Apply cross validation
  • Use regularization when necessary
  • Balance model complexity carefully

Overfitting vs Underfitting Summary

CharacteristicUnderfittingGood FitOverfitting
Training ErrorHighLowVery Low
Testing ErrorHighLowHigh
ComplexityToo LowBalancedToo High
GeneralizationPoorGoodPoor

Model Development Workflow

A typical workflow is:

  1. Train model
  2. Measure training performance
  3. Measure test performance
  4. Compare results
  5. Detect fitting issues
  6. Adjust complexity
  7. Repeat until balanced

Why Understanding Overfitting and Underfitting is Important

The primary goal of Machine Learning is not to memorize historical data but to make accurate predictions on unseen data. Underfitting prevents a model from learning meaningful patterns, while overfitting causes it to learn too much, including noise.

The most successful Machine Learning models strike a balance between these extremes by capturing genuine relationships while ignoring randomness. Understanding Overfitting and Underfitting is essential because nearly every Machine Learning project involves managing this balance to achieve strong real-world performance.

In the next article, we will study Evaluation Metrics for Regression, which help us quantitatively measure how good or bad a regression model's predictions are.