Bivariate Analysis in Machine Learning

Last updated: Jun 11, 2026

Author :

Christy Harshitha Dakarapu

After understanding individual variables through Univariate Analysis, the next step in Exploratory Data Analysis (EDA) is to understand how two variables interact with each other. This process is known as Bivariate Analysis.

The word "Bivariate" means:

Bi = Two

Bivariate Analysis studies the relationship between two variables to identify patterns, dependencies, trends, and associations.

It helps answer questions such as:

Does salary increase with experience?
Does age influence purchasing behavior?
Is there a relationship between education level and income?
Are two features highly correlated?
Does a feature influence the target variable?

Bivariate Analysis is one of the most important steps in Machine Learning because it helps identify useful predictors, detect redundant features, and generate insights for feature engineering.

In this article, we will explore Bivariate Analysis in detail, understand different types of variable relationships, learn common statistical techniques and visualizations, and implement practical examples using Python.

What is Bivariate Analysis?

Bivariate Analysis is the process of analyzing the relationship between two variables.

Unlike Univariate Analysis, which studies variables individually, Bivariate Analysis examines whether changes in one variable are associated with changes in another.

Example:

Experience (Years)	Salary
1	30000
3	45000
5	70000
8	100000

Here we want to understand:

Does salary increase as experience increases?

This is a bivariate analysis problem.

Why Bivariate Analysis Matters

Machine Learning models learn relationships between variables.

Bivariate Analysis helps:

Identify important predictors
Detect feature interactions
Discover hidden patterns
Understand target relationships
Detect multicollinearity
Improve feature selection

Without understanding feature relationships, model building becomes difficult and often less effective.

Types of Bivariate Analysis

The analysis technique depends on the types of variables involved.

There are three common scenarios:

Variable 1	Variable 2
Numerical	Numerical
Numerical	Categorical
Categorical	Categorical

Each scenario requires different methods.

Numerical vs Numerical Analysis

Both variables contain numerical values.

Example:

Experience	Salary
2	35000
4	50000
6	80000

Questions:

Is there a relationship?
How strong is the relationship?
Is the relationship positive or negative?

Scatter Plot

Scatter plots are the most common visualization for numerical variables.

Python:


import matplotlib.pyplot as plt

plt.scatter(
    df["Experience"],
    df["Salary"]
)

plt.xlabel("Experience")
plt.ylabel("Salary")

plt.show()

Understanding Scatter Plots

Positive Relationship:


     *
   *
 *
*

As one variable increases, the other also increases.

Negative Relationship:


*
  *
    *
      *

As one variable increases, the other decreases.

No Relationship:

Random pattern with no clear trend.

Correlation

Correlation measures the strength and direction of a relationship.

Formula:

$r=\frac{Cov(X,Y)}{\sigma_X\sigma_Y}$

Range:

-1 \le r \le 1

Correlation Interpretation

Correlation	Meaning
1	Perfect Positive
0.8	Strong Positive
0.5	Moderate Positive
0	No Relationship
-0.5	Moderate Negative
-1	Perfect Negative

Positive Correlation Example

Experience	Salary
1	30000
2	40000
3	50000

As experience increases, salary increases.

Correlation:

r > 0

Negative Correlation Example

Age of Car	Price
1	1000000
5	600000
10	250000

As age increases, price decreases.

Correlation:

r < 0

Calculating Correlation in Python


df["Experience"].corr(
    df["Salary"]
)

Correlation Matrix

When multiple numerical features exist, correlation matrices become useful.

Python:


df.corr()

Correlation Heatmap


import seaborn as sns

sns.heatmap(
    df.corr(),
    annot=True
)

Heatmaps quickly reveal relationships among many variables.

Covariance

Covariance measures how two variables vary together.

Formula:

$Cov(X,Y)=\frac{\sum(X-\bar X)(Y-\bar Y)}{N}$

Interpretation:

Covariance	Meaning
Positive	Variables move together
Negative	Variables move opposite
Zero	No linear relationship

Covariance in Python


df["Experience"].cov(
    df["Salary"]
)

Correlation vs Covariance

Correlation	Covariance
Standardized	Not standardized
Range -1 to 1	No fixed range
Easier interpretation	Harder interpretation

Correlation is generally preferred.

Numerical vs Categorical Analysis

One variable is numerical while the other is categorical.

Example:

Gender	Salary
Male	50000
Female	45000
Male	60000

Questions:

Does salary differ across genders?
Does age vary across departments?

Box Plot Analysis

Box plots help compare numerical distributions across categories.

Python:


sns.boxplot(
    x="Gender",
    y="Salary",
    data=df
)

This reveals:

Median differences
Distribution spread
Outliers

Group Statistics

Python:


df.groupby(
    "Gender"
)["Salary"].mean()

Output:

Gender	Average Salary
Male	55000
Female	48000

Violin Plots

Violin plots show both:

Distribution
Density

Python:


sns.violinplot(
    x="Department",
    y="Salary",
    data=df
)

ANOVA

ANOVA stands for:

Analysis of Variance

Used when:

Numerical target
Multiple categories

Example:

Compare salaries across departments.

Hypothesis:

H_0

All group means are equal.

Python:


from scipy.stats import f_oneway

f_oneway(
    group1,
    group2,
    group3
)

Categorical vs Categorical Analysis

Both variables are categorical.

Example:

Gender	Purchased
Male	Yes
Female	No

Questions:

Is gender related to purchases?
Is department related to promotion?

Contingency Tables

Python:


pd.crosstab(
    df["Gender"],
    df["Purchased"]
)

Output:

Gender	Yes	No
Male	50	20
Female	40	30

Stacked Bar Charts

Python:


pd.crosstab(
    df["Gender"],
    df["Purchased"]
).plot(
    kind="bar",
    stacked=True
)

Useful for visual comparison.

Chi-Square Test

Used to determine whether two categorical variables are related.

Formula:

$\chi^2=\sum\frac{(Observed-Expected)^2}{Expected}$

Chi-Square Hypothesis

Null Hypothesis:

H_0

Variables are independent.

Alternative Hypothesis:

H_1

Variables are associated.

Chi-Square Example

Python:


from scipy.stats import chi2_contingency

table = pd.crosstab(
    df["Gender"],
    df["Purchased"]
)

chi2, p, _, _ = chi2_contingency(
    table
)

Interpreting p-Value

p-value	Interpretation
< 0.05	Significant Relationship
≥ 0.05	No Significant Relationship

Relationship with Target Variable

One of the most important uses of Bivariate Analysis is studying feature-target relationships.

Example:

Target:

Purchased

Features:

Age
Salary
Gender

Questions:

Does age influence purchases?
Does salary influence purchases?

These insights guide feature selection.

Detecting Multicollinearity

Multicollinearity occurs when features are highly correlated.

Example:

Feature 1	Feature 2
Salary	Annual Income

Correlation:

0.98

Both features contain nearly identical information.

Problems:

Unstable coefficients
Poor interpretability
Reduced model performance

Bivariate Analysis helps detect this issue.

Pair Plots

Pair plots visualize relationships among multiple numerical variables.

Python:


sns.pairplot(df)

Useful for:

Correlation detection
Cluster identification
Outlier detection

Real-World Example

Suppose a bank wants to predict loan approval.

Features:

Income
Credit Score
Employment Type

Target:

Loan Approved

Bivariate Analysis may reveal:

Income strongly influences approval.
Credit score strongly influences approval.
Employment type has moderate impact.

These insights help prioritize important features.

Common Insights Obtained from Bivariate Analysis

Strong predictors
Weak predictors
Feature interactions
Correlations
Class relationships
Multicollinearity
Target associations

Benefits of Bivariate Analysis

Better feature understanding
Improved feature selection
Early detection of multicollinearity
Improved feature engineering
Better model performance
Stronger business insights

Limitations of Bivariate Analysis

Bivariate Analysis studies only two variables at a time.

It cannot fully capture:

Complex interactions
Higher-order relationships
Multivariable dependencies

These require:

Multivariate Analysis

Best Practices

Start with Univariate Analysis first
Use scatter plots for numerical variables
Use box plots for categorical vs numerical variables
Use Chi-Square tests for categorical variables
Check correlations before model training
Investigate highly correlated features
Study feature-target relationships carefully

Bivariate Analysis Workflow

A typical workflow is:

Identify variable types
Choose appropriate visualization
Calculate statistical measures
Analyze relationships
Identify significant associations
Detect multicollinearity
Document findings
Use insights for feature engineering and selection

Why Bivariate Analysis is Important

Bivariate Analysis is the bridge between understanding individual features and building predictive models. It reveals how variables interact, which features influence the target, and whether relationships exist that can improve model performance.

Many important Machine Learning decisions—including feature selection, feature engineering, multicollinearity handling, and model choice—are guided by insights obtained during Bivariate Analysis. It is one of the most valuable stages of Exploratory Data Analysis and an essential skill for every Data Scientist and Machine Learning Engineer.