Univariate Analysis in Machine Learning

Last updated: Jun 11, 2026

Author :

Christy Harshitha Dakarapu

Before building Machine Learning models, it is essential to understand the data thoroughly. One of the first and most important steps in Exploratory Data Analysis (EDA) is Univariate Analysis.

As the name suggests:

Uni = One

Univariate Analysis involves analyzing a single variable at a time to understand its characteristics, distribution, spread, and patterns.

It helps answer questions such as:

What values does the feature contain?
What is the average value?
Is the feature normally distributed?
Are there outliers?
Is the data skewed?
Does the feature require transformation?

Univariate Analysis forms the foundation of EDA because understanding individual features is necessary before studying relationships between multiple features.

In this article, we will explore Univariate Analysis in detail, understand its importance, learn common statistical measures and visualizations, and implement practical examples using Python.

What is Univariate Analysis?

Univariate Analysis is the process of analyzing a single variable independently without considering relationships with other variables.

Example:

Dataset:

Age
21
25
30
35
40

Analyzing only the Age column is Univariate Analysis.

Why Univariate Analysis Matters

Before building Machine Learning models, we must understand:

Data quality
Distribution
Missing values
Outliers
Variability

Univariate Analysis provides these insights quickly.

Benefits include:

Better data understanding
Easier outlier detection
Improved feature engineering
Better preprocessing decisions
Improved model performance

Types of Variables in Univariate Analysis

Variables are generally categorized into:

Type	Examples
Numerical	Age, Salary, Height
Categorical	Gender, Country, Department

The analysis approach differs for each type.

Univariate Analysis for Numerical Variables

Numerical features can be analyzed using:

Mean
Median
Mode
Variance
Standard Deviation
Range
Skewness
Kurtosis

Measures of Central Tendency

Central tendency describes the center of the data.

Common measures include:

Mean
Median
Mode

Mean

Mean is the average value.

Formula:

$\mu=\frac{\sum x_i}{N}$

Example:

Values:

10, 20, 30, 40

Mean:

\frac{10+20+30+40}{4} = 25

Python:


df["Age"].mean()

Median

Median is the middle value after sorting.

Example:

Values:

10, 20, 30, 40, 50

Median:

Python:


df["Age"].median()

Why Median Matters

Median is less sensitive to outliers.

Example:

Values:

10, 20, 30, 40, 1000

Mean:

220

Median:

Median better represents the typical value.

Mode

Mode is the most frequently occurring value.

Example:

Values:

10, 20, 20, 30

Mode:

Python:


df["Age"].mode()

Measures of Dispersion

Dispersion measures how spread out the data is.

Common measures include:

Range
Variance
Standard Deviation
Interquartile Range (IQR)

Range

Formula:

Range = Maximum - Minimum

Example:

Values:

10, 20, 30, 40

Range:

40-10=30

Python:


df["Age"].max() - df["Age"].min()

Variance

Variance measures average squared deviation from the mean.

Formula:

$Variance=\frac{\sum(x_i-\mu)^2}{N}$

Higher variance indicates greater spread.

Python:


df["Age"].var()

Standard Deviation

Standard deviation is the square root of variance.

Formula:

$\sigma=\sqrt{Variance}$

Interpretation:

Low standard deviation → values close to mean
High standard deviation → values spread out

Python:


df["Age"].std()

Interquartile Range (IQR)

IQR measures spread of the middle 50% of data.

Formula:

$IQR=Q3-Q1$

Where:

Q1 = 25th percentile
Q3 = 75th percentile

Python:


Q1 = df["Age"].quantile(0.25)

Q3 = df["Age"].quantile(0.75)

IQR = Q3 - Q1

Understanding Data Distribution

One of the main goals of Univariate Analysis is understanding distributions.

Common distributions include:

Normal Distribution
Right-Skewed Distribution
Left-Skewed Distribution
Uniform Distribution

Normal Distribution

A Normal Distribution is symmetric around the mean.

Characteristics:

Bell-shaped curve
Mean ≈ Median ≈ Mode

Example:

Human heights often follow a normal distribution.

Visualizing Normal Distribution

Histogram:


df["Height"].hist()

A bell-shaped histogram indicates normality.

Skewness

Skewness measures asymmetry.

Formula:

Skewness = 0

for a perfectly symmetric distribution.

Right Skew (Positive Skew)

Characteristics:

Long tail on the right
Mean > Median

Examples:

Income
House prices
Revenue

Python:


df["Income"].skew()

Left Skew (Negative Skew)

Characteristics:

Long tail on the left
Mean < Median

Examples:

Retirement age
Exam scores in easy exams

Interpreting Skewness

Skewness Value	Interpretation
0	Symmetric
> 0	Right Skewed
< 0	Left Skewed

Why Skewness Matters

Many Machine Learning algorithms perform better when features are approximately normally distributed.

Skewed data often requires:

Log Transformation
Square Root Transformation
Box-Cox Transformation

Kurtosis

Kurtosis measures the heaviness of tails.

Formula:

Kurtosis

describes the likelihood of extreme values.

Python:


df["Salary"].kurt()

Interpreting Kurtosis

Value	Interpretation
0	Normal-like
> 0	Heavy Tails
< 0	Light Tails

Higher kurtosis often indicates more outliers.

Histograms

Histograms are among the most useful univariate visualizations.

Python:


import matplotlib.pyplot as plt

df["Age"].hist(
    bins=20
)

plt.show()

Histograms help identify:

Distribution shape
Skewness
Outliers
Gaps

Density Plots

Density plots provide smoother distribution visualization.

Python:


df["Age"].plot(
    kind="density"
)

Box Plots

Box plots help identify outliers.

Python:


import seaborn as sns

sns.boxplot(
    x=df["Salary"]
)

Understanding Box Plots

Components:

Component	Meaning
Box	Middle 50%
Median Line	Central value
Whiskers	Data spread
Dots	Outliers

Detecting Outliers

Example:

Salary
30000
35000
40000
5000000

The final value appears as an outlier.

Outliers often require:

Investigation
Transformation
Removal
Capping

Summary Statistics

Python:


df.describe()

Output includes:

Count
Mean
Standard Deviation
Minimum
Maximum
Quartiles

This provides a quick overview of the feature.

Univariate Analysis for Categorical Variables

Categorical variables require different analysis techniques.

Examples:

Gender
Country
Department

Frequency Distribution

Example:

Gender
Male
Female
Male

Frequency table:

Category	Count
Male	2
Female	1

Python:


df["Gender"].value_counts()

Percentage Distribution

Python:


df["Gender"].value_counts(
    normalize=True
)

Output:

Category	Percentage
Male	66.7%
Female	33.3%

Bar Charts

Bar charts are commonly used for categorical analysis.

Python:


sns.countplot(
    x="Gender",
    data=df
)

Pie Charts

Pie charts display category proportions.

Python:


df["Gender"].value_counts().plot(
    kind="pie"
)

Missing Value Analysis

Univariate Analysis also includes checking missing values.

Python:


df.isnull().sum()

Understanding missingness helps determine:

Imputation strategy
Data quality issues

Real-World Example

Customer Dataset:

Feature
Age
Income
Gender

Univariate Analysis may reveal:

Age is normally distributed
Income is highly skewed
Gender is balanced
Income contains outliers

These insights guide preprocessing decisions.

Common Insights Obtained from Univariate Analysis

Missing values
Outliers
Distribution shape
Feature spread
Data quality issues
Class imbalance
Need for transformations

Benefits of Univariate Analysis

Simple to perform
Easy interpretation
Detects data quality issues early
Improves preprocessing decisions
Supports feature engineering
Provides foundation for advanced EDA

Limitations of Univariate Analysis

Univariate Analysis examines only one variable.

It cannot reveal:

Relationships between features
Correlations
Dependencies
Interactions

For these insights, we need:

Bivariate Analysis
Multivariate Analysis

Best Practices

Start EDA with univariate analysis
Examine every feature individually
Check missing values
Analyze distributions
Detect outliers
Compare mean and median
Visualize numerical and categorical variables

Univariate Analysis Workflow

A typical workflow is:

Identify feature type
Calculate summary statistics
Check missing values
Visualize distribution
Detect outliers
Measure skewness
Identify transformation needs
Document observations

Why Univariate Analysis is Important

Univariate Analysis is the first step toward understanding a dataset. Before studying relationships between variables, we must understand each feature individually. It helps uncover hidden issues, identify preprocessing requirements, detect outliers, and understand data distributions.

A thorough Univariate Analysis often reveals critical insights that significantly improve feature engineering, model selection, and overall Machine Learning performance. It forms the foundation upon which all subsequent data analysis and modeling decisions are built.