Introduction

Clustering is one of the most important techniques in Unsupervised Learning. The primary objective of clustering is to group similar data points together while ensuring that points belonging to different groups remain as distinct as possible.

Many clustering algorithms exist, with K-Means being one of the most widely used. However, K-Means has certain limitations. It requires the number of clusters to be specified beforehand and assumes that clusters are roughly spherical in shape.

In many real-world scenarios, the true number of clusters is unknown. Additionally, data may contain nested relationships where some groups can be further divided into smaller subgroups. In such situations, a more flexible clustering technique is required.

This is where Hierarchical Clustering becomes useful.

Hierarchical Clustering is a clustering technique that builds a hierarchy of clusters using a tree-like structure. Instead of creating a single partition of the data, it generates multiple levels of clustering, allowing analysts to explore relationships at different levels of granularity.

Hierarchical Clustering is widely used in customer segmentation, biological taxonomy, document clustering, image analysis, and gene expression studies.

In this article, we will explore Hierarchical Clustering in detail, understand its working principles, examine dendrograms, learn about different linkage methods, and discuss its advantages and limitations.

What is Hierarchical Clustering?

Hierarchical Clustering is an unsupervised learning algorithm that groups data points into a hierarchy of clusters.

Unlike K-Means, which directly creates a predefined number of clusters, Hierarchical Clustering builds clusters gradually.

The algorithm organizes data into a tree-like structure called a:

Dendrogram

The dendrogram shows how clusters are formed and merged throughout the clustering process.

The primary advantage of Hierarchical Clustering is that it does not require the number of clusters to be specified in advance.

Why is it Called Hierarchical Clustering?

The term "hierarchical" comes from the fact that clusters are arranged in a hierarchy.

Consider the following example:

Living Organisms

Animals

Mammals

Dogs

Each level represents a more detailed grouping.

Similarly, Hierarchical Clustering organizes data into multiple levels of clusters rather than producing a single flat grouping.

Types of Hierarchical Clustering

Hierarchical Clustering can be divided into two main categories.

Agglomerative Hierarchical Clustering

Agglomerative clustering follows a:

Bottom-Up Approach

Initially:

Each Data Point
Is Its Own Cluster

Clusters are gradually merged until all observations belong to a single cluster.

This is the most commonly used form of Hierarchical Clustering.

Divisive Hierarchical Clustering

Divisive clustering follows a:

Top-Down Approach

Initially:

All Data Points
Belong To One Cluster

The algorithm repeatedly splits clusters until individual points remain.

Although conceptually important, divisive clustering is less commonly used because of its higher computational complexity.

Agglomerative Hierarchical Clustering

Agglomerative clustering starts with every data point treated as an individual cluster.

Suppose we have five observations:

A

B

C

D

E

Initially:

{A}

{B}

{C}

{D}

{E}

Each observation forms its own cluster.

The algorithm then repeatedly merges the closest clusters.

Working of Agglomerative Clustering

The clustering process follows several steps.

Step 1: Compute Distance Matrix

The distance between every pair of observations is calculated.

Example:

ABCD
A02610
B2059
C6504
D10940

This matrix represents pairwise distances.

Step 2: Merge Closest Clusters

The smallest distance is identified.

Suppose:

Distance(A,B)=2

Since A and B are closest, they are merged.

New cluster:

{A,B}

Step 3: Update Distance Matrix

Distances between the newly formed cluster and all remaining clusters are recalculated.

Step 4: Repeat

The algorithm continues merging the closest clusters until only one cluster remains.

The sequence of merges forms a hierarchy.

Understanding the Dendrogram

The most important output of Hierarchical Clustering is the dendrogram.

A dendrogram is a tree diagram that visualizes how clusters are merged.

Example:

      ───────
| |
──── ────
| | | |
A B C D

The vertical axis represents the distance at which clusters are merged.

The horizontal axis represents the observations.

Dendrograms help determine the appropriate number of clusters.

How to Read a Dendrogram

In a dendrogram:

  • Lower merges indicate highly similar observations.
  • Higher merges indicate less similar observations.

Consider:

A and B Merge At Distance 2

C and D Merge At Distance 3

Clusters Merge At Distance 10

This suggests:

  • A and B are very similar.
  • C and D are very similar.
  • The two groups are relatively different.

Determining the Number of Clusters

One major advantage of Hierarchical Clustering is that the number of clusters does not need to be specified beforehand.

The dendrogram helps determine the optimal number of clusters.

The process involves drawing a horizontal line through the dendrogram.

Example:

Horizontal Cut

The number of branches intersected by the line corresponds to the number of clusters.

This approach provides flexibility that is not available in K-Means.

Distance Measures in Hierarchical Clustering

Distance metrics determine how similarity is measured.

Common choices include:

Euclidean Distance

Most commonly used.

Measures straight-line distance.

Manhattan Distance

Measures distance along grid-like paths.

Cosine Distance

Often used for text data and embeddings.

Measures similarity based on vector orientation.

The choice of distance metric can significantly affect clustering results.

Linkage Criteria

When clusters contain multiple points, determining the distance between clusters becomes challenging.

Different linkage methods solve this problem differently.

Single Linkage

Single linkage uses the minimum distance between observations from different clusters.

Closest Pair Of Points

determines cluster distance.

Advantages:

  • Can discover irregular cluster shapes.

Disadvantages:

  • Sensitive to noise.
  • May produce chaining effects.

Complete Linkage

Complete linkage uses the maximum distance between observations from different clusters.

Farthest Pair Of Points

determines cluster distance.

Advantages:

  • Produces compact clusters.

Disadvantages:

  • Sensitive to outliers.

Average Linkage

Average linkage computes the average distance between all observations in two clusters.

Advantages:

  • Balanced clustering.
  • Less sensitive to extreme values.

This is one of the most commonly used linkage methods.

Centroid Linkage

Distance is measured between cluster centroids.

Cluster Center

is used for comparisons.

Often produces intuitive clusters.

Ward's Linkage

Ward's method minimizes the increase in within-cluster variance after merging.

Advantages:

  • Produces compact clusters.
  • Often generates high-quality results.

Ward's method is widely used in practice.

Example of Hierarchical Clustering

Suppose a company wants to segment customers based on:

  • Age
  • Income
  • Spending Score

Initially:

Each Customer
Forms A Separate Cluster

Customers with similar purchasing behavior gradually merge into groups.

The final dendrogram may reveal:

Budget Customers

Regular Customers

Premium Customers

This segmentation can support targeted marketing campaigns.

Hierarchical Clustering vs K-Means

Although both are clustering techniques, they differ significantly.

FeatureHierarchical ClusteringK-Means
Number of Clusters Required InitiallyNoYes
OutputDendrogramCluster Labels
Cluster ShapeFlexibleMostly Spherical
Computational ComplexityHigherLower
ScalabilityLimitedBetter
InterpretabilityHighModerate

Hierarchical Clustering provides greater interpretability but is less scalable.

Advantages of Hierarchical Clustering

No Need to Specify K

The number of clusters can be chosen after examining the dendrogram.

Highly Interpretable

The dendrogram provides valuable insight into cluster relationships.

Flexible Cluster Structures

Can discover clusters of different shapes and sizes.

Works Well for Small and Medium Datasets

Often produces meaningful results for moderate-sized datasets.

Reveals Nested Relationships

Captures hierarchical structures naturally.

Limitations of Hierarchical Clustering

Computationally Expensive

Time complexity increases rapidly with dataset size.

Memory Intensive

Requires storing a distance matrix.

Sensitive to Noise

Outliers can significantly influence clustering results.

Difficult to Correct Mistakes

Once clusters are merged, the decision cannot be reversed.

Limited Scalability

Large datasets can become impractical to process.

Applications of Hierarchical Clustering

Hierarchical Clustering is widely used across many domains.

Customer Segmentation

Grouping customers based on behavior and demographics.

Biology

Classifying organisms and species.

Genomics

Analyzing gene expression patterns.

Document Clustering

Grouping similar documents.

Image Analysis

Identifying similar image groups.

Social Network Analysis

Discovering communities within networks.

Market Research

Understanding consumer behavior.

Real-World Example: Gene Expression Analysis

In bioinformatics, researchers analyze thousands of genes simultaneously.

Genes with similar expression patterns may perform related biological functions.

Hierarchical Clustering helps group these genes into meaningful clusters.

The dendrogram allows scientists to visualize relationships among genes and identify biological pathways.