Published on

Distance Metrics in Machine Learning

Authors
Table of Contents

Distance Metrics in Machine Learning

Correlation measures only linear codependence, which can be misleading. Also, it does not satisfy properties of a metric, like nonnegativity and triangle inequality. A metric dρ[X,Y]=1/2(1ρ[X,Y])d_{\rho}[X, Y]=\sqrt{1 / 2(1-\rho[X, Y])} can be formed using correlation. This metric essentially inherits properties from Euclidean distance after z-standardization, making it a 'true metric'.

Another normalized correlation-based distance metric, dρ[X,Y]=1ρ[X,Y]d_{|\rho|}[X, Y]=\sqrt{1-|\rho[X, Y]|}, can also be defined. This metric is especially useful when you need to consider negative correlations as similar for particular applications.

dρ[X,Y]=2T(1ρ[X,Y])d_{\rho}[X, Y] = \sqrt{2 T(1-\rho[X, Y])}
dρ[X,Y]=2T(1ρ[X,Y])d_{|\rho|}[X, Y] = \sqrt{2 T(1-|\rho[X, Y]|)}
PythonJulia
def calc_rho_metric(
X: np.ndarray,
Y: np.ndarray
) -> float:
function calc_rho_metric(
X::Array{Float64,1},
Y::Array{Float64,1}
) -> Float64

View More: Python | Julia

Information Theory: Marginal and Joint Entropy

Correlation has limitations: it neglects nonlinear relationships, is sensitive to outliers, and is mostly meaningful for normally-distributed variables. To address this, we can use Shannon's entropy, defined for a discrete random variable XX:

H[X]=xSXp[x]log[p[x]]H[X] = -\sum_{x \in S_{X}} p[x] \log [p[x]]

This entropy H[X]H[X] measures the amount of uncertainty or 'surprise' associated with XX.

For two discrete random variables X,YX, Y, the joint entropy is:

H[X,Y]=x,ySX×SYp[x,y]log[p[x,y]]H[X, Y] = -\sum_{x, y \in S_{X} \times S_{Y}} p[x, y] \log [p[x, y]]

Conditional Entropy

Conditional entropy H[XY]H[X|Y] measures the remaining uncertainty in XX when YY is known:

H[XY]=H[X,Y]H[Y]H[X \mid Y] = H[X, Y] - H[Y]

Divergence Measures: Kullback-Leibler and Cross-Entropy

Kullback-Leibler (KL) divergence quantifies how one probability distribution pp diverges from another qq:

DKL[pq]=xSXp[x]log[p[x]q[x]]D_{K L}[p \| q] = \sum_{x \in S_{X}} p[x] \log \left[\frac{p[x]}{q[x]}\right]

Cross-entropy HC[pq]H_C[p \| q] measures the information content using a wrong distribution qq rather than the true distribution pp:

HC[pq]=H[X]+DKL[pq]H_{C}[p \| q] = H[X] + D_{K L}[p \| q]

Python

Julia

def calc_kl_divergence(
p: np.ndarray,
q: np.ndarray
) -> float:
function calc_kl_divergence(
p::Array{Float64,1},
q::Array{Float64,1}
) -> Float64

View More: Python | Julia

Mutual Information

The mutual information I[X;Y]I[X;Y] measures the amount of information XX and YY share:

I[X;Y]=H[X]+H[Y]H[X,Y]I[X; Y] = H[X] + H[Y] - H[X, Y]

It can be further generalized to a metric form using normalized mutual information NMINMI to fulfill metric properties.

NMI(X,Y)=I(X;Y)H(X)H(Y)NMI(X,Y) = \frac{I(X;Y)}{\sqrt{H(X)H(Y)}}

Conclusion

Both correlation and entropy-based measures have their places in modern applications. Correlation-based measures are computationally less demanding and have a long history in statistics. In contrast, entropy-based measures provide a comprehensive understanding of relationships between variables. Implementing these concepts can enhance your analytics and decision-making processes.

Python

Julia

def calc_mutual_info(
X: np.ndarray,
Y: np.ndarray
) -> float:
function calc_mutual_info(
X::Array{Float64,1},
Y::Array{Float64,1}
) -> Float64

View More: Python | Julia

Variation of Information: A Simplified Guide

Understanding Variation of Information

Variation of Information (VI) measures how much one variable tells us about another. It has two terms:

  • Uncertainty in XX given YY: H[XY]H[X|Y]
  • Uncertainty in YY given XX: H[YX]H[Y|X]

So, the formula becomes:

VI[X,Y]=H[XY]+H[YX]VI[X, Y] = H[X|Y] + H[Y|X]

We can also express it using other measures like Mutual Information (II) and joint entropy (H[X,Y]H[X, Y]):

VI[X,Y]=2H[X,Y]H[X]H[Y]VI[X, Y] = 2 H[X, Y] - H[X] - H[Y]

or

VI[X,Y]=H[X,Y]I[X,Y]VI[X, Y] = H[X, Y] - I[X, Y]

Normalized VI

To compare VI across varying population sizes, we can normalize it:

VI~[X,Y]=VI[X,Y]H[X,Y]\widetilde{VI}[X, Y] = \frac{VI[X, Y]}{H[X, Y]}

An alternative normalized metric is:

VI~~[X,Y]=1I[X,Y]max{H[X],H[Y]}\tilde{\tilde{VI}}[X, Y] = 1-\frac{I[X, Y]}{\max\{H[X], H[Y]\}}

Calculating Basic Measures: Entropy and Mutual Information

Python

Julia

def basic_measures(
x: np.ndarray,
y: np.ndarray,
bins: int
) -> float
function basic_measures(
x::Array{Float64,1},
y::Array{Float64,1},
bins::Int
) -> Float64

View More: Python | Julia

These functionalities are available in both Python and Julia in the RiskLabAI library.

Continuous Variables and Discretization

For continuous variables, the entropy is calculated using integration. But in practice, we discretize the continuous data into bins to approximate entropy. For a Gaussian variable:

H[X]=12log[2πeσ2]H[X] = \frac{1}{2} \log \left[2 \pi e \sigma^{2}\right]

The entropy can be estimated as:

H^[X]=i=1BXNiNlog[NiN]+log[ΔX]\hat{H}[X]=-\sum_{i=1}^{B_{X}} \frac{N_{i}}{N} \log \left[\frac{N_{i}}{N}\right]+\log \left[\Delta_{X}\right]

For optimal binning, we can use formulas derived by Hacine-Gharbi and others. These vary depending on whether you are looking at the marginal entropy or the joint entropy.

Calculating Variation of Information with Optimal Binning

Python

Julia

def variationsInformation(
x: np.ndarray,
y: np.ndarray,
bins: int,
norm: bool = False
) -> float:
function variationsInformation(
x::Array{Float64,1},
y::Array{Float64,1},
numberOfBins::Int;
norm::Bool = false
) -> Float64

View More: Python | Julia

The code above shows how to calculate the Variation of Information with optimal binning in both Python and Julia. If you set norm=True, you'll get the normalized value.

Understanding Partitions in Data Sets

A partition, denoted as PP, is a way to divide a dataset, DD, into non-overlapping subsets. Mathematically, these subsets DkD_k follow three main properties:

  • Every subset contains at least one element, i.e., Dk>0\|D_k\| > 0.
  • Subsets do not overlap, i.e., DkDl=D_k \cap D_l = \varnothing, for klk \neq l.
  • Together, the subsets cover the entire dataset, i.e., k=1KDk=D\bigcup_{k=1}^{K} D_k = D.

Metrics for Comparing Partitions

We define the uncertainty or randomness associated with a partition PP in terms of entropy, given by:

H[P]=k=1Kp[k]log[p[k]]H[P] = -\sum_{k=1}^{K} p[k] \log [p[k]]

Here, p[k]p[k] is the probability of a randomly chosen element from DD belonging to the subset DkD_k.

If we have another partition PP', we can define several metrics like joint entropy, conditional entropy, mutual information, and variation of information to compare the two partitions. These metrics provide a way to measure the similarity or dissimilarity between two different divisions of the same dataset.

Applications in Machine Learning

Variation of information is particularly useful in unsupervised learning to compare the output from clustering algorithms. It offers a normalized way to compare partitioning methods across various datasets.

Experimental Results Summarized

  • No Relationship: When there's no relation between two variables, both correlation and normalized mutual information are close to zero.
  • Linear Relationship: For a strong linear relationship, both metrics are high but the mutual information is slightly less than 1 due to some uncertainty.
  • Nonlinear Relationship: In this case, correlation fails to capture the relationship, but normalized mutual information reveals a substantial amount of shared information between the variables.

Python

Julia

def numberBins(
numberObservations: int,
correlation = None
):
def variationsInformationExtended(
x: list,
y: list,
norm: bool = False
):
def mutualInformation(
x: list,
y: list,
norm: bool = False
):
function numberBins(
numberObservations::Int,
correlation = nothing
)
function variationsInformationExtended(
x::Array,
y::Array;
norm::Bool = false
)
function mutualInformation(
x::Array,
y::Array;
norm::Bool = false
)

View More: Python | Julia

Figures:

2
3
4

References

  1. De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
  2. De Prado, M. M. L. (2020). Machine learning for asset managers. Cambridge University Press.