After understanding individual variables through Univariate Analysis, the next step in Exploratory Data Analysis (EDA) is to understand how two variables interact with each other. This process is known as Bivariate Analysis.

The word "Bivariate" means:

Bi = Two

Bivariate Analysis studies the relationship between two variables to identify patterns, dependencies, trends, and associations.

It helps answer questions such as:

  • Does salary increase with experience?
  • Does age influence purchasing behavior?
  • Is there a relationship between education level and income?
  • Are two features highly correlated?
  • Does a feature influence the target variable?

Bivariate Analysis is one of the most important steps in Machine Learning because it helps identify useful predictors, detect redundant features, and generate insights for feature engineering.

In this article, we will explore Bivariate Analysis in detail, understand different types of variable relationships, learn common statistical techniques and visualizations, and implement practical examples using Python.

What is Bivariate Analysis?

Bivariate Analysis is the process of analyzing the relationship between two variables.

Unlike Univariate Analysis, which studies variables individually, Bivariate Analysis examines whether changes in one variable are associated with changes in another.

Example:

Experience (Years)Salary
130000
345000
570000
8100000

Here we want to understand:

Does salary increase as experience increases?

This is a bivariate analysis problem.

Why Bivariate Analysis Matters

Machine Learning models learn relationships between variables.

Bivariate Analysis helps:

  • Identify important predictors
  • Detect feature interactions
  • Discover hidden patterns
  • Understand target relationships
  • Detect multicollinearity
  • Improve feature selection

Without understanding feature relationships, model building becomes difficult and often less effective.

Types of Bivariate Analysis

The analysis technique depends on the types of variables involved.

There are three common scenarios:

Variable 1Variable 2
NumericalNumerical
NumericalCategorical
CategoricalCategorical

Each scenario requires different methods.

Numerical vs Numerical Analysis

Both variables contain numerical values.

Example:

ExperienceSalary
235000
450000
680000

Questions:

  • Is there a relationship?
  • How strong is the relationship?
  • Is the relationship positive or negative?

Scatter Plot

Scatter plots are the most common visualization for numerical variables.

Python:

import matplotlib.pyplot as plt

plt.scatter(
df["Experience"],
df["Salary"]
)

plt.xlabel("Experience")
plt.ylabel("Salary")

plt.show()

Understanding Scatter Plots

Positive Relationship:

     *
*
*
*

As one variable increases, the other also increases.

Negative Relationship:

*
*
*
*

As one variable increases, the other decreases.

No Relationship:

Random pattern with no clear trend.

Correlation

Correlation measures the strength and direction of a relationship.

Formula:

r=Cov(X,Y)σXσYr=\frac{Cov(X,Y)}{\sigma_X\sigma_Y}

Range:

1r1-1 \le r \le 1

Correlation Interpretation

CorrelationMeaning
1Perfect Positive
0.8Strong Positive
0.5Moderate Positive
0No Relationship
-0.5Moderate Negative
-1Perfect Negative

Positive Correlation Example

ExperienceSalary
130000
240000
350000

As experience increases, salary increases.

Correlation:

r>0r > 0

Negative Correlation Example

Age of CarPrice
11000000
5600000
10250000

As age increases, price decreases.

Correlation:

r<0r < 0

Calculating Correlation in Python

df["Experience"].corr(
df["Salary"]
)

Correlation Matrix

When multiple numerical features exist, correlation matrices become useful.

Python:

df.corr()

Correlation Heatmap

import seaborn as sns

sns.heatmap(
df.corr(),
annot=True
)

Heatmaps quickly reveal relationships among many variables.

Covariance

Covariance measures how two variables vary together.

Formula:

Cov(X,Y)=(XXˉ)(YYˉ)NCov(X,Y)=\frac{\sum(X-\bar X)(Y-\bar Y)}{N}

Interpretation:

CovarianceMeaning
PositiveVariables move together
NegativeVariables move opposite
ZeroNo linear relationship

Covariance in Python

df["Experience"].cov(
df["Salary"]
)

Correlation vs Covariance

CorrelationCovariance
StandardizedNot standardized
Range -1 to 1No fixed range
Easier interpretationHarder interpretation

Correlation is generally preferred.

Numerical vs Categorical Analysis

One variable is numerical while the other is categorical.

Example:

GenderSalary
Male50000
Female45000
Male60000

Questions:

  • Does salary differ across genders?
  • Does age vary across departments?

Box Plot Analysis

Box plots help compare numerical distributions across categories.

Python:

sns.boxplot(
x="Gender",
y="Salary",
data=df
)

This reveals:

  • Median differences
  • Distribution spread
  • Outliers

Group Statistics

Python:

df.groupby(
"Gender"
)["Salary"].mean()

Output:

GenderAverage Salary
Male55000
Female48000

Violin Plots

Violin plots show both:

  • Distribution
  • Density

Python:

sns.violinplot(
x="Department",
y="Salary",
data=df
)

ANOVA

ANOVA stands for:

Analysis of Variance

Used when:

  • Numerical target
  • Multiple categories

Example:

Compare salaries across departments.

Hypothesis:

H0H_0

All group means are equal.

Python:

from scipy.stats import f_oneway

f_oneway(
group1,
group2,
group3
)

Categorical vs Categorical Analysis

Both variables are categorical.

Example:

GenderPurchased
MaleYes
FemaleNo

Questions:

  • Is gender related to purchases?
  • Is department related to promotion?

Contingency Tables

Python:

pd.crosstab(
df["Gender"],
df["Purchased"]
)

Output:

GenderYesNo
Male5020
Female4030

Stacked Bar Charts

Python:

pd.crosstab(
df["Gender"],
df["Purchased"]
).plot(
kind="bar",
stacked=True
)

Useful for visual comparison.

Chi-Square Test

Used to determine whether two categorical variables are related.

Formula:

χ2=(ObservedExpected)2Expected\chi^2=\sum\frac{(Observed-Expected)^2}{Expected}

Chi-Square Hypothesis

Null Hypothesis:

H0H_0

Variables are independent.

Alternative Hypothesis:

H1H_1

Variables are associated.

Chi-Square Example

Python:

from scipy.stats import chi2_contingency

table = pd.crosstab(
df["Gender"],
df["Purchased"]
)

chi2, p, _, _ = chi2_contingency(
table
)

Interpreting p-Value

p-valueInterpretation
< 0.05Significant Relationship
≥ 0.05No Significant Relationship

Relationship with Target Variable

One of the most important uses of Bivariate Analysis is studying feature-target relationships.

Example:

Target:

Purchased

Features:

  • Age
  • Salary
  • Gender

Questions:

  • Does age influence purchases?
  • Does salary influence purchases?

These insights guide feature selection.

Detecting Multicollinearity

Multicollinearity occurs when features are highly correlated.

Example:

Feature 1Feature 2
SalaryAnnual Income

Correlation:

0.980.98

Both features contain nearly identical information.

Problems:

  • Unstable coefficients
  • Poor interpretability
  • Reduced model performance

Bivariate Analysis helps detect this issue.

Pair Plots

Pair plots visualize relationships among multiple numerical variables.

Python:

sns.pairplot(df)

Useful for:

  • Correlation detection
  • Cluster identification
  • Outlier detection

Real-World Example

Suppose a bank wants to predict loan approval.

Features:

  • Income
  • Credit Score
  • Employment Type

Target:

Loan Approved

Bivariate Analysis may reveal:

  • Income strongly influences approval.
  • Credit score strongly influences approval.
  • Employment type has moderate impact.

These insights help prioritize important features.

Common Insights Obtained from Bivariate Analysis

  • Strong predictors
  • Weak predictors
  • Feature interactions
  • Correlations
  • Class relationships
  • Multicollinearity
  • Target associations

Benefits of Bivariate Analysis

  • Better feature understanding
  • Improved feature selection
  • Early detection of multicollinearity
  • Improved feature engineering
  • Better model performance
  • Stronger business insights

Limitations of Bivariate Analysis

Bivariate Analysis studies only two variables at a time.

It cannot fully capture:

  • Complex interactions
  • Higher-order relationships
  • Multivariable dependencies

These require:

  • Multivariate Analysis

Best Practices

  • Start with Univariate Analysis first
  • Use scatter plots for numerical variables
  • Use box plots for categorical vs numerical variables
  • Use Chi-Square tests for categorical variables
  • Check correlations before model training
  • Investigate highly correlated features
  • Study feature-target relationships carefully

Bivariate Analysis Workflow

A typical workflow is:

  1. Identify variable types
  2. Choose appropriate visualization
  3. Calculate statistical measures
  4. Analyze relationships
  5. Identify significant associations
  6. Detect multicollinearity
  7. Document findings
  8. Use insights for feature engineering and selection

Why Bivariate Analysis is Important

Bivariate Analysis is the bridge between understanding individual features and building predictive models. It reveals how variables interact, which features influence the target, and whether relationships exist that can improve model performance.

Many important Machine Learning decisions—including feature selection, feature engineering, multicollinearity handling, and model choice—are guided by insights obtained during Bivariate Analysis. It is one of the most valuable stages of Exploratory Data Analysis and an essential skill for every Data Scientist and Machine Learning Engineer.