- Published on
Distance Metrics in Machine Learning
- Authors
 - Name
- Tails Azimuth
 
 
Table of Contents
- Information Theory: Marginal and Joint Entropy
- Conditional Entropy
- Divergence Measures: Kullback-Leibler and Cross-Entropy
- Mutual Information
- Conclusion
- Variation of Information: A Simplified Guide
- Understanding Variation of Information
- Normalized VI
- Calculating Basic Measures: Entropy and Mutual Information
- Continuous Variables and Discretization
- Calculating Variation of Information with Optimal Binning
- Understanding Partitions in Data Sets
- Metrics for Comparing Partitions
- Applications in Machine Learning
- Experimental Results Summarized
- References
Distance Metrics in Machine Learning
Correlation measures only linear codependence, which can be misleading. Also, it does not satisfy properties of a metric, like nonnegativity and triangle inequality. A metric can be formed using correlation. This metric essentially inherits properties from Euclidean distance after z-standardization, making it a 'true metric'.
Another normalized correlation-based distance metric, , can also be defined. This metric is especially useful when you need to consider negative correlations as similar for particular applications.
| Python | Julia | 
|---|---|
|  |  | 
Information Theory: Marginal and Joint Entropy
Correlation has limitations: it neglects nonlinear relationships, is sensitive to outliers, and is mostly meaningful for normally-distributed variables. To address this, we can use Shannon's entropy, defined for a discrete random variable :
This entropy measures the amount of uncertainty or 'surprise' associated with .
For two discrete random variables , the joint entropy is:
Conditional Entropy
Conditional entropy measures the remaining uncertainty in when is known:
Divergence Measures: Kullback-Leibler and Cross-Entropy
Kullback-Leibler (KL) divergence quantifies how one probability distribution diverges from another :
Cross-entropy measures the information content using a wrong distribution rather than the true distribution :
| Python | Julia | 
|---|---|
|  |  | 
Mutual Information
The mutual information measures the amount of information and share:
It can be further generalized to a metric form using normalized mutual information to fulfill metric properties.
Conclusion
Both correlation and entropy-based measures have their places in modern applications. Correlation-based measures are computationally less demanding and have a long history in statistics. In contrast, entropy-based measures provide a comprehensive understanding of relationships between variables. Implementing these concepts can enhance your analytics and decision-making processes.
| Python | Julia | 
|---|---|
|  |  | 
Variation of Information: A Simplified Guide
Understanding Variation of Information
Variation of Information (VI) measures how much one variable tells us about another. It has two terms:
- Uncertainty in given :
- Uncertainty in given :
So, the formula becomes:
We can also express it using other measures like Mutual Information () and joint entropy ():
or
Normalized VI
To compare VI across varying population sizes, we can normalize it:
An alternative normalized metric is:
Calculating Basic Measures: Entropy and Mutual Information
| Python | Julia | 
|---|---|
|  |  | 
These functionalities are available in both Python and Julia in the RiskLabAI library.
Continuous Variables and Discretization
For continuous variables, the entropy is calculated using integration. But in practice, we discretize the continuous data into bins to approximate entropy. For a Gaussian variable:
The entropy can be estimated as:
For optimal binning, we can use formulas derived by Hacine-Gharbi and others. These vary depending on whether you are looking at the marginal entropy or the joint entropy.
Calculating Variation of Information with Optimal Binning
| Python | Julia | 
|---|---|
|  |  | 
The code above shows how to calculate the Variation of Information with optimal binning in both Python and Julia. If you set norm=True, you'll get the normalized value.
Understanding Partitions in Data Sets
A partition, denoted as , is a way to divide a dataset, , into non-overlapping subsets. Mathematically, these subsets follow three main properties:
- Every subset contains at least one element, i.e., .
- Subsets do not overlap, i.e., , for .
- Together, the subsets cover the entire dataset, i.e., .
Metrics for Comparing Partitions
We define the uncertainty or randomness associated with a partition in terms of entropy, given by:
Here, is the probability of a randomly chosen element from belonging to the subset .
If we have another partition , we can define several metrics like joint entropy, conditional entropy, mutual information, and variation of information to compare the two partitions. These metrics provide a way to measure the similarity or dissimilarity between two different divisions of the same dataset.
Applications in Machine Learning
Variation of information is particularly useful in unsupervised learning to compare the output from clustering algorithms. It offers a normalized way to compare partitioning methods across various datasets.
Experimental Results Summarized
- No Relationship: When there's no relation between two variables, both correlation and normalized mutual information are close to zero.
- Linear Relationship: For a strong linear relationship, both metrics are high but the mutual information is slightly less than 1 due to some uncertainty.
- Nonlinear Relationship: In this case, correlation fails to capture the relationship, but normalized mutual information reveals a substantial amount of shared information between the variables.
| Python | Julia | 
|---|---|
|  |  | 
Figures:



References
- De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
- De Prado, M. M. L. (2020). Machine learning for asset managers. Cambridge University Press.