Before building Machine Learning models, it is essential to understand the data thoroughly. One of the first and most important steps in Exploratory Data Analysis (EDA) is Univariate Analysis.

As the name suggests:

Uni = One

Univariate Analysis involves analyzing a single variable at a time to understand its characteristics, distribution, spread, and patterns.

It helps answer questions such as:

  • What values does the feature contain?
  • What is the average value?
  • Is the feature normally distributed?
  • Are there outliers?
  • Is the data skewed?
  • Does the feature require transformation?

Univariate Analysis forms the foundation of EDA because understanding individual features is necessary before studying relationships between multiple features.

In this article, we will explore Univariate Analysis in detail, understand its importance, learn common statistical measures and visualizations, and implement practical examples using Python.

What is Univariate Analysis?

Univariate Analysis is the process of analyzing a single variable independently without considering relationships with other variables.

Example:

Dataset:

Age
21
25
30
35
40

Analyzing only the Age column is Univariate Analysis.

Why Univariate Analysis Matters

Before building Machine Learning models, we must understand:

  • Data quality
  • Distribution
  • Missing values
  • Outliers
  • Variability

Univariate Analysis provides these insights quickly.

Benefits include:

  • Better data understanding
  • Easier outlier detection
  • Improved feature engineering
  • Better preprocessing decisions
  • Improved model performance

Types of Variables in Univariate Analysis

Variables are generally categorized into:

TypeExamples
NumericalAge, Salary, Height
CategoricalGender, Country, Department

The analysis approach differs for each type.

Univariate Analysis for Numerical Variables

Numerical features can be analyzed using:

  • Mean
  • Median
  • Mode
  • Variance
  • Standard Deviation
  • Range
  • Skewness
  • Kurtosis

Measures of Central Tendency

Central tendency describes the center of the data.

Common measures include:

  • Mean
  • Median
  • Mode

Mean

Mean is the average value.

Formula:

μ=xiN\mu=\frac{\sum x_i}{N}

Example:

Values:

10, 20, 30, 40

Mean:

10+20+30+404=25\frac{10+20+30+40}{4} = 25

Python:

df["Age"].mean()

Median

Median is the middle value after sorting.

Example:

Values:

10, 20, 30, 40, 50

Median:

30

Python:

df["Age"].median()

Why Median Matters

Median is less sensitive to outliers.

Example:

Values:

10, 20, 30, 40, 1000

Mean:

220

Median:

30

Median better represents the typical value.

Mode

Mode is the most frequently occurring value.

Example:

Values:

10, 20, 20, 30

Mode:

20

Python:

df["Age"].mode()

Measures of Dispersion

Dispersion measures how spread out the data is.

Common measures include:

  • Range
  • Variance
  • Standard Deviation
  • Interquartile Range (IQR)

Range

Formula:

Range=MaximumMinimumRange = Maximum - Minimum

Example:

Values:

10, 20, 30, 40

Range:

4010=3040-10=30

Python:

df["Age"].max() - df["Age"].min()

Variance

Variance measures average squared deviation from the mean.

Formula:

Variance=(xiμ)2NVariance=\frac{\sum(x_i-\mu)^2}{N}

Higher variance indicates greater spread.

Python:

df["Age"].var()

Standard Deviation

Standard deviation is the square root of variance.

Formula:

σ=Variance\sigma=\sqrt{Variance}

Interpretation:

  • Low standard deviation → values close to mean
  • High standard deviation → values spread out

Python:

df["Age"].std()

Interquartile Range (IQR)

IQR measures spread of the middle 50% of data.

Formula:

IQR=Q3Q1IQR=Q3-Q1

Where:

  • Q1 = 25th percentile
  • Q3 = 75th percentile

Python:

Q1 = df["Age"].quantile(0.25)

Q3 = df["Age"].quantile(0.75)

IQR = Q3 - Q1

Understanding Data Distribution

One of the main goals of Univariate Analysis is understanding distributions.

Common distributions include:

  • Normal Distribution
  • Right-Skewed Distribution
  • Left-Skewed Distribution
  • Uniform Distribution

Normal Distribution

A Normal Distribution is symmetric around the mean.

Characteristics:

  • Bell-shaped curve
  • Mean ≈ Median ≈ Mode

Example:

Human heights often follow a normal distribution.

Visualizing Normal Distribution

Histogram:

df["Height"].hist()

A bell-shaped histogram indicates normality.

Skewness

Skewness measures asymmetry.

Formula:

Skewness=0Skewness = 0

for a perfectly symmetric distribution.

Right Skew (Positive Skew)

Characteristics:

  • Long tail on the right
  • Mean > Median

Examples:

  • Income
  • House prices
  • Revenue

Python:

df["Income"].skew()

Left Skew (Negative Skew)

Characteristics:

  • Long tail on the left
  • Mean < Median

Examples:

  • Retirement age
  • Exam scores in easy exams

Interpreting Skewness

Skewness ValueInterpretation
0Symmetric
> 0Right Skewed
< 0Left Skewed

Why Skewness Matters

Many Machine Learning algorithms perform better when features are approximately normally distributed.

Skewed data often requires:

  • Log Transformation
  • Square Root Transformation
  • Box-Cox Transformation

Kurtosis

Kurtosis measures the heaviness of tails.

Formula:

KurtosisKurtosis

describes the likelihood of extreme values.

Python:

df["Salary"].kurt()

Interpreting Kurtosis

ValueInterpretation
0Normal-like
> 0Heavy Tails
< 0Light Tails

Higher kurtosis often indicates more outliers.

Histograms

Histograms are among the most useful univariate visualizations.

Python:

import matplotlib.pyplot as plt

df["Age"].hist(
bins=20
)

plt.show()

Histograms help identify:

  • Distribution shape
  • Skewness
  • Outliers
  • Gaps

Density Plots

Density plots provide smoother distribution visualization.

Python:

df["Age"].plot(
kind="density"
)

Box Plots

Box plots help identify outliers.

Python:

import seaborn as sns

sns.boxplot(
x=df["Salary"]
)

Understanding Box Plots

Components:

ComponentMeaning
BoxMiddle 50%
Median LineCentral value
WhiskersData spread
DotsOutliers

Detecting Outliers

Example:

Salary
30000
35000
40000
5000000

The final value appears as an outlier.

Outliers often require:

  • Investigation
  • Transformation
  • Removal
  • Capping

Summary Statistics

Python:

df.describe()

Output includes:

  • Count
  • Mean
  • Standard Deviation
  • Minimum
  • Maximum
  • Quartiles

This provides a quick overview of the feature.

Univariate Analysis for Categorical Variables

Categorical variables require different analysis techniques.

Examples:

  • Gender
  • Country
  • Department

Frequency Distribution

Example:

Gender
Male
Female
Male

Frequency table:

CategoryCount
Male2
Female1

Python:

df["Gender"].value_counts()

Percentage Distribution

Python:

df["Gender"].value_counts(
normalize=True
)

Output:

CategoryPercentage
Male66.7%
Female33.3%

Bar Charts

Bar charts are commonly used for categorical analysis.

Python:

sns.countplot(
x="Gender",
data=df
)

Pie Charts

Pie charts display category proportions.

Python:

df["Gender"].value_counts().plot(
kind="pie"
)

Missing Value Analysis

Univariate Analysis also includes checking missing values.

Python:

df.isnull().sum()

Understanding missingness helps determine:

  • Imputation strategy
  • Data quality issues

Real-World Example

Customer Dataset:

Feature
Age
Income
Gender

Univariate Analysis may reveal:

  • Age is normally distributed
  • Income is highly skewed
  • Gender is balanced
  • Income contains outliers

These insights guide preprocessing decisions.

Common Insights Obtained from Univariate Analysis

  • Missing values
  • Outliers
  • Distribution shape
  • Feature spread
  • Data quality issues
  • Class imbalance
  • Need for transformations

Benefits of Univariate Analysis

  • Simple to perform
  • Easy interpretation
  • Detects data quality issues early
  • Improves preprocessing decisions
  • Supports feature engineering
  • Provides foundation for advanced EDA

Limitations of Univariate Analysis

Univariate Analysis examines only one variable.

It cannot reveal:

  • Relationships between features
  • Correlations
  • Dependencies
  • Interactions

For these insights, we need:

  • Bivariate Analysis
  • Multivariate Analysis

Best Practices

  • Start EDA with univariate analysis
  • Examine every feature individually
  • Check missing values
  • Analyze distributions
  • Detect outliers
  • Compare mean and median
  • Visualize numerical and categorical variables

Univariate Analysis Workflow

A typical workflow is:

  1. Identify feature type
  2. Calculate summary statistics
  3. Check missing values
  4. Visualize distribution
  5. Detect outliers
  6. Measure skewness
  7. Identify transformation needs
  8. Document observations

Why Univariate Analysis is Important

Univariate Analysis is the first step toward understanding a dataset. Before studying relationships between variables, we must understand each feature individually. It helps uncover hidden issues, identify preprocessing requirements, detect outliers, and understand data distributions.

A thorough Univariate Analysis often reveals critical insights that significantly improve feature engineering, model selection, and overall Machine Learning performance. It forms the foundation upon which all subsequent data analysis and modeling decisions are built.