After understanding individual variables through Univariate Analysis, the next step in Exploratory Data Analysis (EDA) is to understand how two variables interact with each other. This process is known as Bivariate Analysis.
The word "Bivariate" means:
Bi = Two
Bivariate Analysis studies the relationship between two variables to identify patterns, dependencies, trends, and associations.
It helps answer questions such as:
- Does salary increase with experience?
- Does age influence purchasing behavior?
- Is there a relationship between education level and income?
- Are two features highly correlated?
- Does a feature influence the target variable?
Bivariate Analysis is one of the most important steps in Machine Learning because it helps identify useful predictors, detect redundant features, and generate insights for feature engineering.
In this article, we will explore Bivariate Analysis in detail, understand different types of variable relationships, learn common statistical techniques and visualizations, and implement practical examples using Python.
What is Bivariate Analysis?
Bivariate Analysis is the process of analyzing the relationship between two variables.
Unlike Univariate Analysis, which studies variables individually, Bivariate Analysis examines whether changes in one variable are associated with changes in another.
Example:
| Experience (Years) | Salary |
|---|---|
| 1 | 30000 |
| 3 | 45000 |
| 5 | 70000 |
| 8 | 100000 |
Here we want to understand:
Does salary increase as experience increases?
This is a bivariate analysis problem.
Why Bivariate Analysis Matters
Machine Learning models learn relationships between variables.
Bivariate Analysis helps:
- Identify important predictors
- Detect feature interactions
- Discover hidden patterns
- Understand target relationships
- Detect multicollinearity
- Improve feature selection
Without understanding feature relationships, model building becomes difficult and often less effective.
Types of Bivariate Analysis
The analysis technique depends on the types of variables involved.
There are three common scenarios:
| Variable 1 | Variable 2 |
|---|---|
| Numerical | Numerical |
| Numerical | Categorical |
| Categorical | Categorical |
Each scenario requires different methods.
Numerical vs Numerical Analysis
Both variables contain numerical values.
Example:
| Experience | Salary |
|---|---|
| 2 | 35000 |
| 4 | 50000 |
| 6 | 80000 |
Questions:
- Is there a relationship?
- How strong is the relationship?
- Is the relationship positive or negative?
Scatter Plot
Scatter plots are the most common visualization for numerical variables.
Python:
import matplotlib.pyplot as plt
plt.scatter(
df["Experience"],
df["Salary"]
)
plt.xlabel("Experience")
plt.ylabel("Salary")
plt.show()
Understanding Scatter Plots
Positive Relationship:
*
*
*
*
As one variable increases, the other also increases.
Negative Relationship:
*
*
*
*
As one variable increases, the other decreases.
No Relationship:
Random pattern with no clear trend.
Correlation
Correlation measures the strength and direction of a relationship.
Formula:
Range:
Correlation Interpretation
| Correlation | Meaning |
|---|---|
| 1 | Perfect Positive |
| 0.8 | Strong Positive |
| 0.5 | Moderate Positive |
| 0 | No Relationship |
| -0.5 | Moderate Negative |
| -1 | Perfect Negative |
Positive Correlation Example
| Experience | Salary |
|---|---|
| 1 | 30000 |
| 2 | 40000 |
| 3 | 50000 |
As experience increases, salary increases.
Correlation:
Negative Correlation Example
| Age of Car | Price |
|---|---|
| 1 | 1000000 |
| 5 | 600000 |
| 10 | 250000 |
As age increases, price decreases.
Correlation:
Calculating Correlation in Python
df["Experience"].corr(
df["Salary"]
)
Correlation Matrix
When multiple numerical features exist, correlation matrices become useful.
Python:
df.corr()
Correlation Heatmap
import seaborn as sns
sns.heatmap(
df.corr(),
annot=True
)
Heatmaps quickly reveal relationships among many variables.
Covariance
Covariance measures how two variables vary together.
Formula:
Interpretation:
| Covariance | Meaning |
|---|---|
| Positive | Variables move together |
| Negative | Variables move opposite |
| Zero | No linear relationship |
Covariance in Python
df["Experience"].cov(
df["Salary"]
)
Correlation vs Covariance
| Correlation | Covariance |
|---|---|
| Standardized | Not standardized |
| Range -1 to 1 | No fixed range |
| Easier interpretation | Harder interpretation |
Correlation is generally preferred.
Numerical vs Categorical Analysis
One variable is numerical while the other is categorical.
Example:
| Gender | Salary |
|---|---|
| Male | 50000 |
| Female | 45000 |
| Male | 60000 |
Questions:
- Does salary differ across genders?
- Does age vary across departments?
Box Plot Analysis
Box plots help compare numerical distributions across categories.
Python:
sns.boxplot(
x="Gender",
y="Salary",
data=df
)
This reveals:
- Median differences
- Distribution spread
- Outliers
Group Statistics
Python:
df.groupby(
"Gender"
)["Salary"].mean()
Output:
| Gender | Average Salary |
|---|---|
| Male | 55000 |
| Female | 48000 |
Violin Plots
Violin plots show both:
- Distribution
- Density
Python:
sns.violinplot(
x="Department",
y="Salary",
data=df
)
ANOVA
ANOVA stands for:
Analysis of Variance
Used when:
- Numerical target
- Multiple categories
Example:
Compare salaries across departments.
Hypothesis:
All group means are equal.
Python:
from scipy.stats import f_oneway
f_oneway(
group1,
group2,
group3
)
Categorical vs Categorical Analysis
Both variables are categorical.
Example:
| Gender | Purchased |
|---|---|
| Male | Yes |
| Female | No |
Questions:
- Is gender related to purchases?
- Is department related to promotion?
Contingency Tables
Python:
pd.crosstab(
df["Gender"],
df["Purchased"]
)
Output:
| Gender | Yes | No |
|---|---|---|
| Male | 50 | 20 |
| Female | 40 | 30 |
Stacked Bar Charts
Python:
pd.crosstab(
df["Gender"],
df["Purchased"]
).plot(
kind="bar",
stacked=True
)
Useful for visual comparison.
Chi-Square Test
Used to determine whether two categorical variables are related.
Formula:
Chi-Square Hypothesis
Null Hypothesis:
Variables are independent.
Alternative Hypothesis:
Variables are associated.
Chi-Square Example
Python:
from scipy.stats import chi2_contingency
table = pd.crosstab(
df["Gender"],
df["Purchased"]
)
chi2, p, _, _ = chi2_contingency(
table
)
Interpreting p-Value
| p-value | Interpretation |
|---|---|
| < 0.05 | Significant Relationship |
| ≥ 0.05 | No Significant Relationship |
Relationship with Target Variable
One of the most important uses of Bivariate Analysis is studying feature-target relationships.
Example:
Target:
Purchased
Features:
- Age
- Salary
- Gender
Questions:
- Does age influence purchases?
- Does salary influence purchases?
These insights guide feature selection.
Detecting Multicollinearity
Multicollinearity occurs when features are highly correlated.
Example:
| Feature 1 | Feature 2 |
|---|---|
| Salary | Annual Income |
Correlation:
Both features contain nearly identical information.
Problems:
- Unstable coefficients
- Poor interpretability
- Reduced model performance
Bivariate Analysis helps detect this issue.
Pair Plots
Pair plots visualize relationships among multiple numerical variables.
Python:
sns.pairplot(df)
Useful for:
- Correlation detection
- Cluster identification
- Outlier detection
Real-World Example
Suppose a bank wants to predict loan approval.
Features:
- Income
- Credit Score
- Employment Type
Target:
Loan Approved
Bivariate Analysis may reveal:
- Income strongly influences approval.
- Credit score strongly influences approval.
- Employment type has moderate impact.
These insights help prioritize important features.
Common Insights Obtained from Bivariate Analysis
- Strong predictors
- Weak predictors
- Feature interactions
- Correlations
- Class relationships
- Multicollinearity
- Target associations
Benefits of Bivariate Analysis
- Better feature understanding
- Improved feature selection
- Early detection of multicollinearity
- Improved feature engineering
- Better model performance
- Stronger business insights
Limitations of Bivariate Analysis
Bivariate Analysis studies only two variables at a time.
It cannot fully capture:
- Complex interactions
- Higher-order relationships
- Multivariable dependencies
These require:
- Multivariate Analysis
Best Practices
- Start with Univariate Analysis first
- Use scatter plots for numerical variables
- Use box plots for categorical vs numerical variables
- Use Chi-Square tests for categorical variables
- Check correlations before model training
- Investigate highly correlated features
- Study feature-target relationships carefully
Bivariate Analysis Workflow
A typical workflow is:
- Identify variable types
- Choose appropriate visualization
- Calculate statistical measures
- Analyze relationships
- Identify significant associations
- Detect multicollinearity
- Document findings
- Use insights for feature engineering and selection
Why Bivariate Analysis is Important
Bivariate Analysis is the bridge between understanding individual features and building predictive models. It reveals how variables interact, which features influence the target, and whether relationships exist that can improve model performance.
Many important Machine Learning decisions—including feature selection, feature engineering, multicollinearity handling, and model choice—are guided by insights obtained during Bivariate Analysis. It is one of the most valuable stages of Exploratory Data Analysis and an essential skill for every Data Scientist and Machine Learning Engineer.