Before building Machine Learning models, it is essential to understand the data thoroughly. One of the first and most important steps in Exploratory Data Analysis (EDA) is Univariate Analysis.
As the name suggests:
Uni = One
Univariate Analysis involves analyzing a single variable at a time to understand its characteristics, distribution, spread, and patterns.
It helps answer questions such as:
- What values does the feature contain?
- What is the average value?
- Is the feature normally distributed?
- Are there outliers?
- Is the data skewed?
- Does the feature require transformation?
Univariate Analysis forms the foundation of EDA because understanding individual features is necessary before studying relationships between multiple features.
In this article, we will explore Univariate Analysis in detail, understand its importance, learn common statistical measures and visualizations, and implement practical examples using Python.
What is Univariate Analysis?
Univariate Analysis is the process of analyzing a single variable independently without considering relationships with other variables.
Example:
Dataset:
| Age |
|---|
| 21 |
| 25 |
| 30 |
| 35 |
| 40 |
Analyzing only the Age column is Univariate Analysis.
Why Univariate Analysis Matters
Before building Machine Learning models, we must understand:
- Data quality
- Distribution
- Missing values
- Outliers
- Variability
Univariate Analysis provides these insights quickly.
Benefits include:
- Better data understanding
- Easier outlier detection
- Improved feature engineering
- Better preprocessing decisions
- Improved model performance
Types of Variables in Univariate Analysis
Variables are generally categorized into:
| Type | Examples |
|---|---|
| Numerical | Age, Salary, Height |
| Categorical | Gender, Country, Department |
The analysis approach differs for each type.
Univariate Analysis for Numerical Variables
Numerical features can be analyzed using:
- Mean
- Median
- Mode
- Variance
- Standard Deviation
- Range
- Skewness
- Kurtosis
Measures of Central Tendency
Central tendency describes the center of the data.
Common measures include:
- Mean
- Median
- Mode
Mean
Mean is the average value.
Formula:
Example:
Values:
10, 20, 30, 40
Mean:
Python:
df["Age"].mean()
Median
Median is the middle value after sorting.
Example:
Values:
10, 20, 30, 40, 50
Median:
30
Python:
df["Age"].median()
Why Median Matters
Median is less sensitive to outliers.
Example:
Values:
10, 20, 30, 40, 1000
Mean:
220
Median:
30
Median better represents the typical value.
Mode
Mode is the most frequently occurring value.
Example:
Values:
10, 20, 20, 30
Mode:
20
Python:
df["Age"].mode()
Measures of Dispersion
Dispersion measures how spread out the data is.
Common measures include:
- Range
- Variance
- Standard Deviation
- Interquartile Range (IQR)
Range
Formula:
Example:
Values:
10, 20, 30, 40
Range:
Python:
df["Age"].max() - df["Age"].min()
Variance
Variance measures average squared deviation from the mean.
Formula:
Higher variance indicates greater spread.
Python:
df["Age"].var()
Standard Deviation
Standard deviation is the square root of variance.
Formula:
Interpretation:
- Low standard deviation → values close to mean
- High standard deviation → values spread out
Python:
df["Age"].std()
Interquartile Range (IQR)
IQR measures spread of the middle 50% of data.
Formula:
Where:
- Q1 = 25th percentile
- Q3 = 75th percentile
Python:
Q1 = df["Age"].quantile(0.25)
Q3 = df["Age"].quantile(0.75)
IQR = Q3 - Q1
Understanding Data Distribution
One of the main goals of Univariate Analysis is understanding distributions.
Common distributions include:
- Normal Distribution
- Right-Skewed Distribution
- Left-Skewed Distribution
- Uniform Distribution
Normal Distribution
A Normal Distribution is symmetric around the mean.
Characteristics:
- Bell-shaped curve
- Mean ≈ Median ≈ Mode
Example:
Human heights often follow a normal distribution.
Visualizing Normal Distribution
Histogram:
df["Height"].hist()
A bell-shaped histogram indicates normality.
Skewness
Skewness measures asymmetry.
Formula:
for a perfectly symmetric distribution.
Right Skew (Positive Skew)
Characteristics:
- Long tail on the right
- Mean > Median
Examples:
- Income
- House prices
- Revenue
Python:
df["Income"].skew()
Left Skew (Negative Skew)
Characteristics:
- Long tail on the left
- Mean < Median
Examples:
- Retirement age
- Exam scores in easy exams
Interpreting Skewness
| Skewness Value | Interpretation |
|---|---|
| 0 | Symmetric |
| > 0 | Right Skewed |
| < 0 | Left Skewed |
Why Skewness Matters
Many Machine Learning algorithms perform better when features are approximately normally distributed.
Skewed data often requires:
- Log Transformation
- Square Root Transformation
- Box-Cox Transformation
Kurtosis
Kurtosis measures the heaviness of tails.
Formula:
describes the likelihood of extreme values.
Python:
df["Salary"].kurt()
Interpreting Kurtosis
| Value | Interpretation |
|---|---|
| 0 | Normal-like |
| > 0 | Heavy Tails |
| < 0 | Light Tails |
Higher kurtosis often indicates more outliers.
Histograms
Histograms are among the most useful univariate visualizations.
Python:
import matplotlib.pyplot as plt
df["Age"].hist(
bins=20
)
plt.show()
Histograms help identify:
- Distribution shape
- Skewness
- Outliers
- Gaps
Density Plots
Density plots provide smoother distribution visualization.
Python:
df["Age"].plot(
kind="density"
)
Box Plots
Box plots help identify outliers.
Python:
import seaborn as sns
sns.boxplot(
x=df["Salary"]
)
Understanding Box Plots
Components:
| Component | Meaning |
|---|---|
| Box | Middle 50% |
| Median Line | Central value |
| Whiskers | Data spread |
| Dots | Outliers |
Detecting Outliers
Example:
| Salary |
|---|
| 30000 |
| 35000 |
| 40000 |
| 5000000 |
The final value appears as an outlier.
Outliers often require:
- Investigation
- Transformation
- Removal
- Capping
Summary Statistics
Python:
df.describe()
Output includes:
- Count
- Mean
- Standard Deviation
- Minimum
- Maximum
- Quartiles
This provides a quick overview of the feature.
Univariate Analysis for Categorical Variables
Categorical variables require different analysis techniques.
Examples:
- Gender
- Country
- Department
Frequency Distribution
Example:
| Gender |
|---|
| Male |
| Female |
| Male |
Frequency table:
| Category | Count |
|---|---|
| Male | 2 |
| Female | 1 |
Python:
df["Gender"].value_counts()
Percentage Distribution
Python:
df["Gender"].value_counts(
normalize=True
)
Output:
| Category | Percentage |
|---|---|
| Male | 66.7% |
| Female | 33.3% |
Bar Charts
Bar charts are commonly used for categorical analysis.
Python:
sns.countplot(
x="Gender",
data=df
)
Pie Charts
Pie charts display category proportions.
Python:
df["Gender"].value_counts().plot(
kind="pie"
)
Missing Value Analysis
Univariate Analysis also includes checking missing values.
Python:
df.isnull().sum()
Understanding missingness helps determine:
- Imputation strategy
- Data quality issues
Real-World Example
Customer Dataset:
| Feature |
|---|
| Age |
| Income |
| Gender |
Univariate Analysis may reveal:
- Age is normally distributed
- Income is highly skewed
- Gender is balanced
- Income contains outliers
These insights guide preprocessing decisions.
Common Insights Obtained from Univariate Analysis
- Missing values
- Outliers
- Distribution shape
- Feature spread
- Data quality issues
- Class imbalance
- Need for transformations
Benefits of Univariate Analysis
- Simple to perform
- Easy interpretation
- Detects data quality issues early
- Improves preprocessing decisions
- Supports feature engineering
- Provides foundation for advanced EDA
Limitations of Univariate Analysis
Univariate Analysis examines only one variable.
It cannot reveal:
- Relationships between features
- Correlations
- Dependencies
- Interactions
For these insights, we need:
- Bivariate Analysis
- Multivariate Analysis
Best Practices
- Start EDA with univariate analysis
- Examine every feature individually
- Check missing values
- Analyze distributions
- Detect outliers
- Compare mean and median
- Visualize numerical and categorical variables
Univariate Analysis Workflow
A typical workflow is:
- Identify feature type
- Calculate summary statistics
- Check missing values
- Visualize distribution
- Detect outliers
- Measure skewness
- Identify transformation needs
- Document observations
Why Univariate Analysis is Important
Univariate Analysis is the first step toward understanding a dataset. Before studying relationships between variables, we must understand each feature individually. It helps uncover hidden issues, identify preprocessing requirements, detect outliers, and understand data distributions.
A thorough Univariate Analysis often reveals critical insights that significantly improve feature engineering, model selection, and overall Machine Learning performance. It forms the foundation upon which all subsequent data analysis and modeling decisions are built.