In the previous article, we learned that K-Nearest Neighbors (KNN) makes predictions by looking at nearby data points.
This naturally raises an important question:
How does KNN determine which points are nearest?
Consider the following points:
Point A = (2,3)
Point B = (4,5)
Point C = (20,25)
Intuitively, Point B appears much closer to Point A than Point C.
But computers cannot rely on intuition.
They need a mathematical way to measure closeness.
This measurement is called a Distance Metric.
Distance Metrics are among the most important concepts in Machine Learning because they are used in:
- K-Nearest Neighbors (KNN)
- K-Means Clustering
- Recommendation Systems
- Image Recognition
- Anomaly Detection
- Information Retrieval
- Natural Language Processing
In this article, we will explore distance metrics from first principles, understand the most commonly used distance formulas, and learn when to use each one.
What is a Distance Metric?
A distance metric is a mathematical method used to measure how far apart two data points are.
The basic intuition is:
Small Distance
↓
More Similar
Large Distance
↓
Less Similar
KNN assumes:
Nearby points
↓
Likely belong to the same class
Therefore, measuring distance accurately is crucial.
Why Distance Matters in KNN
Suppose we want to classify a new student.
Dataset:
| Study Hours | Result |
|---|---|
| 2 | Fail |
| 3 | Fail |
| 8 | Pass |
| 9 | Pass |
New Student:
Study Hours = 4
Which examples should influence the prediction?
Clearly:
2 and 3
are closer than:
8 and 9
Distance metrics help KNN determine this mathematically.
Distance in One Dimension
Consider:
Point A = 5
Point B = 8
Distance:
Simple subtraction works.
However, real-world data usually contains multiple features.
Distance in Two Dimensions
Suppose:
A = (2,3)
B = (5,7)
How do we measure distance now?
The most common answer is:
Euclidean Distance
Euclidean Distance
Euclidean Distance is the straight-line distance between two points.
It is the distance we normally think about in geometry.
Formula:
Visualizing Euclidean Distance
(2,7)
*
|\
| \
| \
| \
| \
*-----*
(2,3) (5,3)
The diagonal line represents Euclidean Distance.
Example Calculation
Points:
Distance:
Euclidean Distance for Multiple Features
For n dimensions:
This is the most commonly used distance metric in KNN.
Properties of Euclidean Distance
- Easy to understand
- Works well for continuous numerical data
- Most common choice in KNN
- Sensitive to feature scaling
Why Feature Scaling Matters
Suppose:
| Feature | Value |
|---|---|
| Age | 25 |
| Salary | 500000 |
Salary dominates distance calculations because its values are much larger.
Therefore:
Feature scaling is often necessary before applying KNN.
Manhattan Distance
Imagine traveling through city streets arranged in a grid.
You cannot move diagonally.
You must travel horizontally and vertically.
This leads to:
Manhattan Distance
Formula:
Why the Name Manhattan?
The streets of Manhattan form a grid-like structure.
Travel occurs along blocks rather than diagonally.
Visualizing Manhattan Distance
Start *
|
|
|
*----*
End
Movement occurs along edges.
Example Calculation
Points:
Distance:
Euclidean vs Manhattan
For the same points:
| Metric | Distance |
|---|---|
| Euclidean | 5 |
| Manhattan | 7 |
Different metrics produce different results.
When Manhattan Distance Works Well
Useful when:
- Features represent independent dimensions
- Grid-based movement exists
- Outlier influence should be reduced
Minkowski Distance
Euclidean and Manhattan Distance are actually special cases of a more general metric called:
Minkowski Distance
Formula:
Where:
controls the distance type.
Special Cases
If:
Manhattan Distance.
If:
Euclidean Distance.
Thus:
Minkowski generalizes both metrics.
Hamming Distance
Used for categorical or binary data.
Measures:
Number of positions where values differ.
Example
Strings:
101110
100100
Differences:
Position 3
Position 5
Hamming Distance:
Applications
- DNA Analysis
- Binary Data
- Error Detection
- Text Processing
Cosine Similarity
Sometimes direction matters more than distance.
Example:
Text Documents.
Two documents may have very different lengths but discuss the same topic.
In such cases, Cosine Similarity is preferred.
Formula:
Intuition
Measures:
Angle Between Vectors
rather than physical distance.
Interpretation
| Cosine Similarity | Meaning |
|---|---|
| 1 | Identical Direction |
| 0 | Unrelated |
| -1 | Opposite Direction |
Applications
- Search Engines
- NLP
- Recommendation Systems
- Text Similarity
Choosing the Right Distance Metric
There is no universally best metric.
Choice depends on:
- Data type
- Problem domain
- Feature characteristics
Numerical Data
Preferred:
Euclidean Distance
Grid-Based Problems
Preferred:
Manhattan Distance
Binary Data
Preferred:
Hamming Distance
Text Data
Preferred:
Cosine Similarity
Example: KNN Classification
Suppose:
Training Points:
A
A
B
B
New Point:
?
KNN:
- Calculates distance to all points.
- Sorts distances.
- Selects nearest neighbors.
- Performs voting.
The distance metric directly determines which neighbors are selected.
Effect of Different Metrics
The same dataset can produce different predictions depending on the chosen metric.
Example:
Euclidean → Class A
Manhattan → Class B
This is why metric selection is important.
Python Example: Euclidean Distance
from scipy.spatial.distance import euclidean
distance = euclidean(
[2,3],
[5,7]
)
print(distance)
Python Example: Manhattan Distance
from scipy.spatial.distance import cityblock
distance = cityblock(
[2,3],
[5,7]
)
Python Example: Hamming Distance
from scipy.spatial.distance import hamming
distance = hamming(
[1,0,1,1],
[1,0,0,1]
)
Python Example: Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
Real-World Applications
Recommendation Systems
Finding users with similar preferences.
Face Recognition
Comparing facial feature vectors.
Search Engines
Finding relevant documents.
Fraud Detection
Identifying transactions similar to known frauds.
Medical Diagnosis
Finding patients with similar characteristics.
Common Mistakes
Ignoring Feature Scaling
Different feature scales distort distance calculations.
Using Euclidean Distance Everywhere
Other metrics may be more suitable.
Mixing Numerical and Categorical Data
Distance calculations become unreliable.
Best Practices
- Scale numerical features
- Understand data characteristics
- Experiment with multiple metrics
- Use cross-validation
- Consider domain knowledge
Distance Metrics Summary
| Metric | Best For |
|---|---|
| Euclidean | Continuous Numerical Data |
| Manhattan | Grid-Based Problems |
| Minkowski | Generalized Distance |
| Hamming | Binary/Categorical Data |
| Cosine Similarity | Text and Vector Similarity |
Why Distance Metrics are Important
Distance metrics form the foundation of similarity-based Machine Learning algorithms. KNN, clustering methods, recommendation systems, and many modern AI applications depend heavily on how similarity is measured.
Choosing the right distance metric can dramatically affect model performance because it determines which observations are considered neighbors. Understanding these metrics helps practitioners select appropriate algorithms, preprocess data correctly, and build more reliable Machine Learning systems.
In the next article, we will study Choosing K in K-Nearest Neighbors, one of the most important decisions in KNN that directly affects model accuracy, overfitting, and underfitting.