Distance

Euclidean Distance

If point $x=(x_1,x_2,…,x_n)$ and $y=(y_1,y_2,…y_n)$

$d(x,y)=\sqrt{\sum_{i=1}^{n}(x_i-y_i)^2}$

Manhattan Distance

The name came from if we want to drive from one building to another in Mahattan, we need to follow the road (x,y axis) rather than go for a straight line between these two points.

It is the projection of Euclidean distance to the x,y axis. Where $a(x_{11},x_{12},…,x_{1n})$ and $b(y_{21},y_{22},…,y_{2n})$ has

$d_{ab}=\sum_{i=1}^{n}|x_{1i}-y_{2i}|$

Mahalanobis Distance

It is a very useful way of determining the “similarity” of a set of values from an “unknown”
sample to a collection of “known” samples. The Mahalanobis distance between x and $\mu$ is

$D_{x}=\sqrt{(x-\mu)^TS^{-1}(x-\mu)}$

Comparing the the Euclidean distance. “Euclidean distance only makes sense when all the dimensions have the same units (like meters), since it involves adding the squared value of them. When you are dealing with probabilities, a lot of times the features have different units. For example: I have a model for men and women, based on their weight [Kg] and height [m]. I know the mean and covariance for each. Now I get a new measurement set of weight and height and I try to decide if it’s a man or a woman. I can use the Mahalanobis distance from the models of both men and women to decide which is closer, meaning which is more probable.”

Chebyshev Distance

Defined as

$D(p,q) = \max_{i}(|p_i,q_i|)$

Where $p$ and $q$ is the coordinate of two points and it can also be wrote as

$\lim_{k\to \infty}(\sum_{i=1}^{n}|p_i-q_i|^k)^{1/k}$

it is called $L-\infty$ distance

Minkowski Distance

Distance between group, defined as

$d_{12}=\sqrt[p]{\sum_{k=1}^{n}|x_{1k}-x_{2k}|^p}$

Where

$p=1$ ,it is Manhattan Distance

$p=2$ , it is Mahalanobis Distance

$p \to \infty$ , it is Chebyshev Distance

Standardized Euclidean Distance

Bhattacharyya Distance

Cosine

For 2 data in n-dimension, cosine is defined as

$cos(\theta)=\frac{a\cdot b}{|a||b|}$

if $a(x_{11},x_{12},…,x_{1n})$ and $b(x_{21},x_{22},..,x_{2n})$

$cos(\theta)=\frac{\sum_{k=1}^{n}x_{1k}x_{2k}}{\sqrt{\sum_{k=1}^{n}x_{1k}^2}\sqrt{\sum_{k=1}^{n}x_{2k}^2}}$

Jaccard Similarity Coefficient

Defined as

$J(A,B)=\frac{|A\cap B|}{|A\cup B|}$

Pearson Correlation Coefficient

$\rho_{xy} = \frac{cov(x,y)}{\sigma_x \sigma_y} = \frac{(x-\mu_x)(y-\mu_y)}{\sigma_x \sigma_y}$