Summary: This tutorial provides implementations of four key similarity metrics (Jaccard, Euclidean, Cosine, and Dot Product) with a focus on PyTorch-based implementations. It covers practical code examples for calculating text and embedding similarities, including optimized implementations using torch.nn.functional and matrix operations. The tutorial helps with tasks like comparing text similarity, working with embedding vectors, and choosing appropriate similarity metrics. Key features include numerically stable implementations, GPU-compatible code, best practices for each metric, and integration with embedding models like BGE, with special attention to normalization considerations and performance optimization.

# Similarity

In this section, we will introduce several different ways to measure similarity.

## 1. Jaccard Similarity

Before directly calculate the similarity between embedding vectors, let's first take a look at the primal method for measuring how similar two sentenses are: Jaccard similarity.

**Definition:** For sets $A$ and $B$, the Jaccard index, or the Jaccard similarity coefficient between them is the size of their intersection divided by the size of their union:
$$J(A,B)=\frac{|A\cap B|}{|A\cup B|}$$

The value of $J(A,B)$ falls in the range of $[0, 1]$.


```python
def jaccard_similarity(sentence1, sentence2):
    set1 = set(sentence1.split(" "))
    set2 = set(sentence2.split(" "))
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection)/len(union)
```


```python
s1 = "Hawaii is a wonderful place for holiday"
s2 = "Peter's favorite place to spend his holiday is Hawaii"
s3 = "Anna enjoys baking during her holiday"
```


```python
jaccard_similarity(s1, s2)
```


```python
jaccard_similarity(s1, s3)
```

We can see that sentence 1 and 2 are sharing 'Hawaii', 'place', and 'holiday'. Thus getting a larger score of similarity (0.333) than that (0.083) of the sentence 1 and 3 that only share 'holiday'.

## 2. Euclidean Distance


```python
import torch

A = torch.randint(1, 7, (1, 4), dtype=torch.float32)
B = torch.randint(1, 7, (1, 4), dtype=torch.float32)
print(A, B)
```

**Definition:** For vectors $A$ and $B$, the Euclidean distance or L2 distance between them is defined as:
$$d(A, B) = \|A-B\|_2 = \sqrt{\sum_{i=1}^n (A_i-B_i)^2}$$

The value of $d(A, B)$ falls in the range of [0, $+\infty$). Since this is the measurement of distance, the closer the value is to 0, the more similar the two vector is. And the larger the value is, the two vectors are more dissimilar.

You can calculate Euclidean distance step by step or directly call *torch.cdist()*


```python
dist = torch.sqrt(torch.sum(torch.pow(torch.subtract(A, B), 2), dim=-1))
dist.item()
```


```python
torch.cdist(A, B, p=2).item()
```

### (Maximum inner-product search)

## 3. Cosine Similarity

For vectors $A$ and $B$, their cosine similarity is defined as:
$$\cos(\theta)=\frac{A\cdot B}{\|A\|\|B\|}$$

The value of $\cos(\theta)$ falls in the range of $[-1, 1]$. Different from Euclidean distance, close to -1 denotes not similar at all and close to +1 means very similar.

### 3.1 Naive Approach

The naive approach is just expanding the expression:
$$\frac{A\cdot B}{\|A\|\|B\|}=\frac{\sum_{i=1}^{i=n}A_i B_i}{\sqrt{\sum_{i=1}^{n}A_i^2}\cdot\sqrt{\sum_{i=1}^{n}B_i^2}}$$


```python
# Compute the dot product of A and B
dot_prod = sum(a*b for a, b in zip(A[0], B[0]))

# Compute the magnitude of A and B
A_norm = torch.sqrt(sum(a*a for a in A[0]))
B_norm = torch.sqrt(sum(b*b for b in B[0]))
```


```python
cos_1 = dot_prod / (A_norm * B_norm)
print(cos_1.item())
```

### 3.2 PyTorch Implementation

The naive approach has few issues:
- There are chances of losing precision in the numerator and the denominator
- Losing precision may cause the computed cosine similarity > 1.0

Thus PyTorch uses the following way:

$$
\frac{A\cdot B}{\|A\|\|B\|}=\frac{A}{\|A\|}\cdot\frac{B}{\|B\|}
$$


```python
res = torch.mm(A / A.norm(dim=1), B.T / B.norm(dim=1))
print(res.item())
```

### 3.3 PyTorch Function Call

In practice, the most convinient way is directly use *cosine_similarity()* in torch.nn.functional:


```python
import torch.nn.functional as F

F.cosine_similarity(A, B).item()
```

## 4. Inner Product/Dot Product

Coordinate definition:
$$A\cdot B = \sum_{i=1}^{i=n}A_i B_i$$

Geometric definition:
$$A\cdot B = \|A\|\|B\|\cos(\theta)$$


```python
dot_prod = A @ B.T
dot_prod.item()
```

### Relationship with Cosine similarity

For computing the distance/similarity between two vectors, dot product and Cos similarity are closely related. Cos similarity only cares about the angle difference (because it is normalized by the product of two vectors' magnitude), while dot product takes both magnitude and angle into consideration. So the two metrics are preferred in different use cases.

The BGE series models already normalized the output embedding vector to have the magnitude of 1. Thus using dot product and cos similarity will have the same result.


```python
from FlagEmbedding import FlagModel

model = FlagModel('BAAI/bge-large-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)
```


```python
sentence = "I am very interested in natural language processing"
embedding = torch.tensor(model.encode(sentence))
torch.norm(embedding).item()
```

## 5. Examples

Now we've learned the mechanism of different types of similarity. Let's look at a real example.


```python
sentence_1 = "I will watch a show tonight"
sentence_2 = "I will show you my watch tonight"
sentence_3 = "I'm going to enjoy a performance this evening"
```

It's clear to us that in sentence 1, 'watch' is a verb and 'show' is a noun. 

But in sentence 2, 'show' is a verb and 'watch' is a noun, which leads to different meaning of the two sentences.

While sentence 3 has very similar meaning to sentence 1.

Now let's see how does different similarity metrics tell us the relationship of the sentences.


```python
print(jaccard_similarity(sentence_1, sentence_2))
print(jaccard_similarity(sentence_1, sentence_3))
```

The results show that sentence 1 and 2 (0.625) are way more similar than sentence 1 and 3 (0.077), which indicate the opposite conclusion compare to what we have made.

Now let's first get the embeddings of these sentences.


```python
embeddings = torch.from_numpy(model.encode([sentence_1, sentence_2, sentence_3]))
embedding_1 = embeddings[0].view(1, -1)
embedding_2 = embeddings[1].view(1, -1)
embedding_3 = embeddings[2].view(1, -1)

print(embedding_1.shape)
```

Then let's compute the Euclidean distance:


```python
euc_dist1_2 = torch.cdist(embedding_1, embedding_2, p=2).item()
euc_dist1_3 = torch.cdist(embedding_1, embedding_3, p=2).item()
print(euc_dist1_2)
print(euc_dist1_3)
```

Then, let's see the cosine similarity:


```python
cos_dist1_2 = F.cosine_similarity(embedding_1, embedding_2).item()
cos_dist1_3 = F.cosine_similarity(embedding_1, embedding_3).item()
print(cos_dist1_2)
print(cos_dist1_3)
```

Using embedding, we can get the correct result different from Jaccard similarity that sentence 1 and 2 should be more similar than sentence 1 and 3 using either Euclidean distance or cos similarity as the metric.
