Assumptions of Linear Regression

Last updated: Jun 12, 2026

Author :

Christy Harshitha Dakarapu

Linear Regression is one of the simplest and most widely used Machine Learning algorithms. However, it works effectively only when certain assumptions about the data are satisfied.

Many beginners train a Linear Regression model and immediately evaluate its performance using metrics such as R² or RMSE. While this may work for simple projects, professional data scientists also verify whether the underlying assumptions of Linear Regression are satisfied.

These assumptions help ensure that:

Predictions are reliable
Coefficients are meaningful
Statistical conclusions are valid
The model generalizes well

Violating these assumptions does not always make the model unusable, but it can significantly reduce its reliability and interpretability.

In this article, we will understand each assumption intuitively, learn why it matters, how to detect violations, and possible solutions.

Why Does Linear Regression Have Assumptions?

Linear Regression attempts to model a relationship between features and the target variable.

Example:

House Price Prediction:

Price=f(Area, Bedrooms, Location)

For the mathematical theory behind Linear Regression to work properly, the data must satisfy certain conditions.

These conditions are called assumptions.

Overview of Linear Regression Assumptions

The major assumptions are:

Linearity
Independence of Errors
Homoscedasticity
Normality of Residuals
No Multicollinearity
No Significant Outliers

A common mnemonic used is:

LINE

Linearity
Independence
Normality
Equal Variance (Homoscedasticity)

Assumption 1: Linearity

Linear Regression assumes that the relationship between features and the target variable is approximately linear.

What Does Linearity Mean?

A change in the input feature should produce a proportional change in the target variable.

Example:

Experience	Salary
1	3
3	5
5	8
8	12

As experience increases, salary generally increases.

This appears linear.

Linear Relationship Visualization


Salary
 ^
 |
 |        *
 |      *
 |    *
 |  *
 +---------------->
   Experience

A straight line can reasonably represent the relationship.

Non-Linear Relationship Example


Salary
 ^
 |
 |       *
 |    *
 | *
 |      *
 |          *
 +---------------->

A straight line cannot capture this pattern well.

Why Linearity Matters

Linear Regression fits a straight line (or hyperplane).

If the true relationship is highly non-linear:

Predictions become inaccurate
Errors increase
Model performance drops

Detecting Linearity

Scatter plots are commonly used.

Python:


import seaborn as sns

sns.scatterplot(
    x="Experience",
    y="Salary",
    data=df
)

Solutions for Non-Linearity

Polynomial Regression
Feature Engineering
Log Transformations
Decision Trees
Random Forests

Assumption 2: Independence of Errors

Residuals should be independent of one another.

What are Residuals?

Residual:

$Residual=y_{actual}-y_{predicted}$

Residuals represent prediction errors.

Understanding Independence

Errors should occur randomly.

Bad Example:


Error
 ^
 |
 | *
 |  *
 |   *
 |    *
 +------------>
 Time

Residuals follow a pattern.

This violates independence.

Why Independence Matters

Dependent errors indicate that the model is missing important information.

Common in:

Time Series Data
Financial Data
Sensor Data

Detecting Independence

Residual plots.

Durbin-Watson Test.

Python:


from statsmodels.stats.stattools import durbin_watson

durbin_watson(residuals)

Interpretation:

Value	Meaning
≈ 2	Independent Errors
< 2	Positive Correlation
> 2	Negative Correlation

Assumption 3: Homoscedasticity

One of the most important assumptions.

What is Homoscedasticity?

The variance of residuals should remain constant across all predictions.

Good Residual Distribution


Residuals
 ^
 |  *  * *
 | * * * *
 |* * * * *
 +---------------->
 Predicted Values

Spread remains roughly constant.

Heteroscedasticity (Violation)


Residuals
 ^
 | *
 |  *
 |    *
 |       *
 |           *
 +---------------->
 Predicted Values

Error spread increases.

Why Homoscedasticity Matters

Violations can lead to:

Unreliable coefficients
Incorrect confidence intervals
Poor statistical inference

Detecting Homoscedasticity

Residual Plot:


import matplotlib.pyplot as plt

plt.scatter(
    predictions,
    residuals
)

Look for:

Constant spread → Good
Funnel shape → Problem

Solutions

Log Transformation
Box-Cox Transformation
Weighted Regression

Assumption 4: Normality of Residuals

Residuals should approximately follow a normal distribution.

Important Clarification

The target variable does NOT need to be normally distributed.

The residuals should be approximately normal.

Why Normality Matters

Normal residuals ensure:

Reliable hypothesis testing
Reliable confidence intervals
Valid statistical interpretation

Visualizing Normal Residuals


        *
      * * *
    * * * * *
      * * *
        *

Bell-shaped distribution.

Checking Normality

Histogram:


residuals.hist()

Q-Q Plot:


from scipy import stats

stats.probplot(
    residuals,
    plot=plt
)

Solutions

Remove extreme outliers
Transform variables
Use robust regression techniques

Assumption 5: No Multicollinearity

Features should not be highly correlated with each other.

What is Multicollinearity?

Example:

Feature 1	Feature 2
Monthly Salary	Annual Salary

These features contain nearly identical information.

Why Multicollinearity is a Problem

Problems include:

Unstable coefficients
Inflated variance
Difficult interpretation

Example

Suppose:

Price= 5(Area)+ 3(Size)

If Area and Size mean almost the same thing, the model struggles to determine their individual contributions.

Detecting Multicollinearity

Correlation Heatmap:


sns.heatmap(
    df.corr(),
    annot=True
)

High correlations:

|r| > 0.8

often indicate problems.

Variance Inflation Factor (VIF)

Most common method.

Formula:

$VIF=\frac{1}{1-R^2}$

Python:


from statsmodels.stats.outliers_influence import variance_inflation_factor

Interpreting VIF

VIF	Interpretation
< 5	Usually Safe
5-10	Moderate Concern
> 10	Serious Multicollinearity

Solutions

Remove redundant features
Use PCA
Apply Regularization

Assumption 6: No Significant Outliers

Outliers can heavily influence the regression line.

Example

Experience	Salary
1	3
2	4
3	5
20	100

The final observation may distort the model.

Why Outliers Matter

Outliers can:

Change coefficients
Increase error
Reduce predictive accuracy

Detecting Outliers

Box Plot:


sns.boxplot(
    x=df["Salary"]
)

Z-Score:


from scipy.stats import zscore

IQR Method:


Q1 = df["Salary"].quantile(0.25)
Q3 = df["Salary"].quantile(0.75)

Solutions

Remove invalid outliers
Cap extreme values
Use robust models

Assumptions Summary

Assumption	Purpose
Linearity	Relationship should be linear
Independence	Errors should be independent
Homoscedasticity	Constant error variance
Normality	Residuals should be normally distributed
No Multicollinearity	Features should not be highly correlated
No Significant Outliers	Extreme values should not dominate

Real-World Example

Suppose we build a house price prediction model.

Features:

Area
Bedrooms
Location Score
Age

Before trusting the model, we check:

Is price linearly related to these features?
Are residuals random?
Is variance constant?
Are residuals normal?
Are features highly correlated?
Are extreme outliers present?

Only after these checks can we confidently interpret the results.

What Happens if Assumptions are Violated?

Not all violations are equally serious.

Assumption	Effect of Violation
Linearity	Poor predictions
Independence	Biased inference
Homoscedasticity	Incorrect significance tests
Normality	Unreliable statistical conclusions
Multicollinearity	Unstable coefficients
Outliers	Distorted regression line

Common Misconceptions

Linear Regression Requires Normal Data

Incorrect.

Only residuals need to be approximately normal.

Violating One Assumption Makes the Model Useless

Incorrect.

Some violations have minor effects.

Others can be severe.

High Accuracy Means Assumptions Don't Matter

Incorrect.

A model may have good predictive performance while still producing unreliable interpretations.

Best Practices

Always visualize data first
Examine residual plots
Check correlation matrices
Compute VIF values
Investigate outliers
Validate assumptions before interpreting coefficients

Linear Regression Assumption Checklist

Before deploying a Linear Regression model:

✔ Check linearity

✔ Check residual independence

✔ Check homoscedasticity

✔ Check residual normality

✔ Check multicollinearity

✔ Check outliers

Why Understanding Assumptions is Important

Linear Regression is not just about fitting a line through data. Its mathematical foundation depends on several assumptions that ensure reliable predictions and trustworthy interpretations.

Understanding these assumptions helps Data Scientists diagnose problems, improve model quality, and avoid misleading conclusions. In professional Machine Learning workflows, validating assumptions is often just as important as training the model itself.

In the next article, we will explore Overfitting and Underfitting, two critical concepts that explain why some models perform poorly despite appearing successful during training.