Introduction
Clustering is one of the most important techniques in Unsupervised Learning. The primary objective of clustering is to group similar data points together while ensuring that points belonging to different groups remain as distinct as possible.
Many clustering algorithms exist, with K-Means being one of the most widely used. However, K-Means has certain limitations. It requires the number of clusters to be specified beforehand and assumes that clusters are roughly spherical in shape.
In many real-world scenarios, the true number of clusters is unknown. Additionally, data may contain nested relationships where some groups can be further divided into smaller subgroups. In such situations, a more flexible clustering technique is required.
This is where Hierarchical Clustering becomes useful.
Hierarchical Clustering is a clustering technique that builds a hierarchy of clusters using a tree-like structure. Instead of creating a single partition of the data, it generates multiple levels of clustering, allowing analysts to explore relationships at different levels of granularity.
Hierarchical Clustering is widely used in customer segmentation, biological taxonomy, document clustering, image analysis, and gene expression studies.
In this article, we will explore Hierarchical Clustering in detail, understand its working principles, examine dendrograms, learn about different linkage methods, and discuss its advantages and limitations.
What is Hierarchical Clustering?
Hierarchical Clustering is an unsupervised learning algorithm that groups data points into a hierarchy of clusters.
Unlike K-Means, which directly creates a predefined number of clusters, Hierarchical Clustering builds clusters gradually.
The algorithm organizes data into a tree-like structure called a:
Dendrogram
The dendrogram shows how clusters are formed and merged throughout the clustering process.
The primary advantage of Hierarchical Clustering is that it does not require the number of clusters to be specified in advance.
Why is it Called Hierarchical Clustering?
The term "hierarchical" comes from the fact that clusters are arranged in a hierarchy.
Consider the following example:
Living Organisms
↓
Animals
↓
Mammals
↓
Dogs
Each level represents a more detailed grouping.
Similarly, Hierarchical Clustering organizes data into multiple levels of clusters rather than producing a single flat grouping.
Types of Hierarchical Clustering
Hierarchical Clustering can be divided into two main categories.
Agglomerative Hierarchical Clustering
Agglomerative clustering follows a:
Bottom-Up Approach
Initially:
Each Data Point
Is Its Own Cluster
Clusters are gradually merged until all observations belong to a single cluster.
This is the most commonly used form of Hierarchical Clustering.
Divisive Hierarchical Clustering
Divisive clustering follows a:
Top-Down Approach
Initially:
All Data Points
Belong To One Cluster
The algorithm repeatedly splits clusters until individual points remain.
Although conceptually important, divisive clustering is less commonly used because of its higher computational complexity.
Agglomerative Hierarchical Clustering
Agglomerative clustering starts with every data point treated as an individual cluster.
Suppose we have five observations:
A
B
C
D
E
Initially:
{A}
{B}
{C}
{D}
{E}
Each observation forms its own cluster.
The algorithm then repeatedly merges the closest clusters.
Working of Agglomerative Clustering
The clustering process follows several steps.
Step 1: Compute Distance Matrix
The distance between every pair of observations is calculated.
Example:
| A | B | C | D | |
|---|---|---|---|---|
| A | 0 | 2 | 6 | 10 |
| B | 2 | 0 | 5 | 9 |
| C | 6 | 5 | 0 | 4 |
| D | 10 | 9 | 4 | 0 |
This matrix represents pairwise distances.
Step 2: Merge Closest Clusters
The smallest distance is identified.
Suppose:
Distance(A,B)=2
Since A and B are closest, they are merged.
New cluster:
{A,B}
Step 3: Update Distance Matrix
Distances between the newly formed cluster and all remaining clusters are recalculated.
Step 4: Repeat
The algorithm continues merging the closest clusters until only one cluster remains.
The sequence of merges forms a hierarchy.
Understanding the Dendrogram
The most important output of Hierarchical Clustering is the dendrogram.
A dendrogram is a tree diagram that visualizes how clusters are merged.
Example:
───────
| |
──── ────
| | | |
A B C D
The vertical axis represents the distance at which clusters are merged.
The horizontal axis represents the observations.
Dendrograms help determine the appropriate number of clusters.
How to Read a Dendrogram
In a dendrogram:
- Lower merges indicate highly similar observations.
- Higher merges indicate less similar observations.
Consider:
A and B Merge At Distance 2
C and D Merge At Distance 3
Clusters Merge At Distance 10
This suggests:
- A and B are very similar.
- C and D are very similar.
- The two groups are relatively different.
Determining the Number of Clusters
One major advantage of Hierarchical Clustering is that the number of clusters does not need to be specified beforehand.
The dendrogram helps determine the optimal number of clusters.
The process involves drawing a horizontal line through the dendrogram.
Example:
Horizontal Cut
The number of branches intersected by the line corresponds to the number of clusters.
This approach provides flexibility that is not available in K-Means.
Distance Measures in Hierarchical Clustering
Distance metrics determine how similarity is measured.
Common choices include:
Euclidean Distance
Most commonly used.
Measures straight-line distance.
Manhattan Distance
Measures distance along grid-like paths.
Cosine Distance
Often used for text data and embeddings.
Measures similarity based on vector orientation.
The choice of distance metric can significantly affect clustering results.
Linkage Criteria
When clusters contain multiple points, determining the distance between clusters becomes challenging.
Different linkage methods solve this problem differently.
Single Linkage
Single linkage uses the minimum distance between observations from different clusters.
Closest Pair Of Points
determines cluster distance.
Advantages:
- Can discover irregular cluster shapes.
Disadvantages:
- Sensitive to noise.
- May produce chaining effects.
Complete Linkage
Complete linkage uses the maximum distance between observations from different clusters.
Farthest Pair Of Points
determines cluster distance.
Advantages:
- Produces compact clusters.
Disadvantages:
- Sensitive to outliers.
Average Linkage
Average linkage computes the average distance between all observations in two clusters.
Advantages:
- Balanced clustering.
- Less sensitive to extreme values.
This is one of the most commonly used linkage methods.
Centroid Linkage
Distance is measured between cluster centroids.
Cluster Center
is used for comparisons.
Often produces intuitive clusters.
Ward's Linkage
Ward's method minimizes the increase in within-cluster variance after merging.
Advantages:
- Produces compact clusters.
- Often generates high-quality results.
Ward's method is widely used in practice.
Example of Hierarchical Clustering
Suppose a company wants to segment customers based on:
- Age
- Income
- Spending Score
Initially:
Each Customer
Forms A Separate Cluster
Customers with similar purchasing behavior gradually merge into groups.
The final dendrogram may reveal:
Budget Customers
Regular Customers
Premium Customers
This segmentation can support targeted marketing campaigns.
Hierarchical Clustering vs K-Means
Although both are clustering techniques, they differ significantly.
| Feature | Hierarchical Clustering | K-Means |
|---|---|---|
| Number of Clusters Required Initially | No | Yes |
| Output | Dendrogram | Cluster Labels |
| Cluster Shape | Flexible | Mostly Spherical |
| Computational Complexity | Higher | Lower |
| Scalability | Limited | Better |
| Interpretability | High | Moderate |
Hierarchical Clustering provides greater interpretability but is less scalable.
Advantages of Hierarchical Clustering
No Need to Specify K
The number of clusters can be chosen after examining the dendrogram.
Highly Interpretable
The dendrogram provides valuable insight into cluster relationships.
Flexible Cluster Structures
Can discover clusters of different shapes and sizes.
Works Well for Small and Medium Datasets
Often produces meaningful results for moderate-sized datasets.
Reveals Nested Relationships
Captures hierarchical structures naturally.
Limitations of Hierarchical Clustering
Computationally Expensive
Time complexity increases rapidly with dataset size.
Memory Intensive
Requires storing a distance matrix.
Sensitive to Noise
Outliers can significantly influence clustering results.
Difficult to Correct Mistakes
Once clusters are merged, the decision cannot be reversed.
Limited Scalability
Large datasets can become impractical to process.
Applications of Hierarchical Clustering
Hierarchical Clustering is widely used across many domains.
Customer Segmentation
Grouping customers based on behavior and demographics.
Biology
Classifying organisms and species.
Genomics
Analyzing gene expression patterns.
Document Clustering
Grouping similar documents.
Image Analysis
Identifying similar image groups.
Social Network Analysis
Discovering communities within networks.
Market Research
Understanding consumer behavior.
Real-World Example: Gene Expression Analysis
In bioinformatics, researchers analyze thousands of genes simultaneously.
Genes with similar expression patterns may perform related biological functions.
Hierarchical Clustering helps group these genes into meaningful clusters.
The dendrogram allows scientists to visualize relationships among genes and identify biological pathways.