# Condensed: Evaluation Metrics

Summary: This tutorial provides implementations of five fundamental evaluation metrics for embedding and information retrieval models: Recall, Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (nDCG), Precision, and Mean Average Precision (MAP). It helps with tasks involving ranking evaluation, similarity search assessment, and recommendation system metrics. The tutorial includes numpy-based implementations, mathematical formulas, and handles edge cases. Key features include multi-cutoff evaluation support, vectorized calculations for efficiency, proper normalization techniques, and integration with scikit-learn for nDCG computation. The code is particularly useful for evaluating search results, embeddings quality, and ranking algorithms.

*This is a condensed version that preserves essential implementation details and context.*

Here's the condensed tutorial focusing on essential implementation details and key concepts:

# Evaluation Metrics for Embedding Models

## Key Metrics Implementation

### Setup
```python
import numpy as np
from sklearn.metrics import ndcg_score

# Example data structure
ground_truth = [[11, 1, 7, 17, 21], [4, 16, 1], [26, 10, 22, 8]]
results = [[11, 1, 17, 7, 21, 8, 0, 28, 9, 20], 
           [16, 1, 6, 18, 3, 4, 25, 19, 8, 14],
           [24, 10, 26, 2, 8, 28, 4, 23, 13, 21]]
cutoffs = [1, 5, 10]
```

### 1. Recall
```python
def calc_recall(preds, truths, cutoffs):
    recalls = np.zeros(len(cutoffs))
    for text, truth in zip(preds, truths):
        for i, c in enumerate(cutoffs):
            hits = np.intersect1d(truth, text[:c])
            recalls[i] += len(hits) / max(min(c, len(truth)), 1)
    return recalls / len(preds)
```

### 2. Mean Reciprocal Rank (MRR)
```python
def calc_MRR(preds, truth, cutoffs):
    mrr = [0] * len(cutoffs)
    for pred, t in zip(preds, truth):
        for i, c in enumerate(cutoffs):
            for j, p in enumerate(pred):
                if j < c and p in t:
                    mrr[i] += 1/(j+1)
                    break
    return [k/len(preds) for k in mrr]
```

### 3. Normalized Discounted Cumulative Gain (nDCG)
```python
# Convert to binary relevance encodings
pred_hard_encodings = [list(np.isin(pred, label).astype(int)) 
                      for pred, label in zip(results, ground_truth)]

# Calculate nDCG using sklearn
nDCG = ndcg_score(pred_hard_encodings, results, k=cutoff)
```

### 4. Precision
```python
def calc_precision(preds, truths, cutoffs):
    prec = np.zeros(len(cutoffs))
    for text, truth in zip(preds, truths):
        for i, c in enumerate(cutoffs):
            hits = np.intersect1d(truth, text[:c])
            prec[i] += len(hits) / c
    return prec / len(preds)
```

### 5. Mean Average Precision (MAP)
```python
def calc_AP(encoding):
    rel = 0
    precs = 0.0
    for k, hit in enumerate(encoding, start=1):
        if hit == 1:
            rel += 1
            precs += rel/k
    return 0 if rel == 0 else precs/rel

def calc_MAP(encodings, cutoffs):
    return [sum(calc_AP(enc[:c]) for enc in encodings)/len(encodings) 
            for c in cutoffs]
```

## Key Formulas

- Recall = |Relevant ∩ Retrieved| / min(|Retrieved|, |Relevant|)
- MRR = (1/|Q|) ∑(1/rank_i)
- nDCG_p = DCG_p/IDCG_p
- Precision = |Relevant ∩ Retrieved| / |Retrieved|
- MAP = (1/N) ∑AP_i

## Best Practices
1. Always evaluate at multiple cutoff points (e.g., @1, @5, @10)
2. Use multiple metrics for comprehensive evaluation
3. Consider both precision and recall-based metrics
4. Normalize scores when comparing across different queries
5. Handle edge cases (empty results, no relevant documents)