# Condensed: Similarity

Summary: This tutorial provides implementations of four key similarity metrics (Jaccard, Euclidean, Cosine, and Dot Product) with a focus on PyTorch-based implementations. It covers practical code examples for calculating text and embedding similarities, including optimized implementations using torch.nn.functional and matrix operations. The tutorial helps with tasks like comparing text similarity, working with embedding vectors, and choosing appropriate similarity metrics. Key features include numerically stable implementations, GPU-compatible code, best practices for each metric, and integration with embedding models like BGE, with special attention to normalization considerations and performance optimization.

*This is a condensed version that preserves essential implementation details and context.*

Here's the condensed tutorial focusing on key implementation details and concepts:

# Similarity Metrics Implementation Guide

## 1. Jaccard Similarity
```python
def jaccard_similarity(sentence1, sentence2):
    set1 = set(sentence1.split(" "))
    set2 = set(sentence2.split(" "))
    return len(set1.intersection(set2))/len(set1.union(set2))
```
- Range: [0, 1]
- Based on word overlap
- Simple but limited semantic understanding

## 2. Euclidean Distance
```python
# Method 1: Manual calculation
dist = torch.sqrt(torch.sum(torch.pow(torch.subtract(A, B), 2), dim=-1))

# Method 2: Using torch.cdist
dist = torch.cdist(A, B, p=2)
```
- Range: [0, ∞)
- Lower values indicate higher similarity
- Formula: $d(A, B) = \sqrt{\sum_{i=1}^n (A_i-B_i)^2}$

## 3. Cosine Similarity

### Best Practice Implementation (PyTorch)
```python
# Recommended: Using torch.nn.functional
import torch.nn.functional as F
similarity = F.cosine_similarity(A, B)

# Alternative: Manual normalization
res = torch.mm(A / A.norm(dim=1), B.T / B.norm(dim=1))
```

**Important Notes:**
- Range: [-1, 1]
- Higher values indicate higher similarity
- Formula: $\cos(\theta)=\frac{A\cdot B}{\|A\|\|B\|}$
- Prefer PyTorch's implementation over naive approach for numerical stability

## 4. Inner Product/Dot Product
```python
dot_prod = A @ B.T
```

**Key Considerations:**
- For normalized vectors (e.g., BGE embeddings), dot product equals cosine similarity
- Takes both magnitude and angle into account
- Formula: $A\cdot B = \|A\|\|B\|\cos(\theta)$

## Implementation Example with Embeddings
```python
from FlagEmbedding import FlagModel

model = FlagModel('BAAI/bge-large-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

# Get embeddings
embeddings = torch.from_numpy(model.encode([sentence_1, sentence_2, sentence_3]))

# Calculate similarity
cos_similarity = F.cosine_similarity(embedding_1, embedding_2)
euc_distance = torch.cdist(embedding_1, embedding_2, p=2)
```

**Best Practices:**
1. Use embedding-based similarity over Jaccard for semantic understanding
2. Choose cosine similarity for normalized vectors
3. Consider using GPU acceleration for large-scale computations
4. Verify vector normalization when comparing dot product vs cosine similarity