Summary: This tutorial provides implementations of five fundamental evaluation metrics for embedding and information retrieval models: Recall, Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (nDCG), Precision, and Mean Average Precision (MAP). It helps with tasks involving ranking evaluation, similarity search assessment, and recommendation system metrics. The tutorial includes numpy-based implementations, mathematical formulas, and handles edge cases. Key features include multi-cutoff evaluation support, vectorized calculations for efficiency, proper normalization techniques, and integration with scikit-learn for nDCG computation. The code is particularly useful for evaluating search results, embeddings quality, and ranking algorithms.

# Evaluation Metrics

In this tutorial, we'll cover a list of metrics that are widely used for evaluating embedding model's performance.

## 0. Preparation


```python
%pip install numpy scikit-learn
```

Suppose we have a corpus with document ids from 0 - 30. 
- `ground_truth` contains the actual relevant document ids to each query.
- `results` contains the search results of each query by some retrieval system.


```python
import numpy as np

ground_truth = [
    [11,  1,  7, 17, 21],
    [ 4, 16,  1],
    [26, 10, 22,  8],
]

results = [
    [11,  1, 17,  7, 21,  8,  0, 28,  9, 20],
    [16,  1,  6, 18,  3,  4, 25, 19,  8, 14],
    [24, 10, 26,  2,  8, 28,  4, 23, 13, 21],
]
```


```python
np.intersect1d(ground_truth, results)
```


```python
np.isin(ground_truth, results).astype(int)
```

And we are interested in the following cutoffs:


```python
cutoffs = [1, 5, 10]
```

In this tutorial, we will use the above small example to show how different metrics evaluate the retrieval system's quality.

## 1. Recall

Recall represents the model's capability of correctly predicting positive instances from all the actual positive samples in the dataset.

$$\textbf{Recall}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}}$$

to write it in the form of information retrieval, which is the ratio of relevant documents retrieved to the total number of relevant documents in the corpus. In practice, we usually make the denominator to be the minimum between the current cutoff (usually 1, 5, 10, 100, etc) and the total number of relevant documents in the corpus:

$$\textbf{Recall}=\frac{|\text{\{Relevant docs\}}\cap\text{\{Retrieved docs\}}|}{\text{min}(|\text{\{Retrieved docs\}}|, |\text{\{Relevant docs\}}|)}$$


```python
def calc_recall(preds, truths, cutoffs):
    recalls = np.zeros(len(cutoffs))
    for text, truth in zip(preds, truths):
        for i, c in enumerate(cutoffs):
            hits = np.intersect1d(truth, text[:c])
            recalls[i] += len(hits) / max(min(c, len(truth)), 1)
    recalls /= len(preds)
    return recalls
```


```python
recalls = calc_recall(results, ground_truth, cutoffs)
for i, c in enumerate(cutoffs):
    print(f"recall@{c}: {recalls[i]}")
```

## 2. MRR

Mean Reciprocal Rank ([MRR](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) is a widely used metric in information retrieval to evaluate the effectiveness of a system. It measures the rank position of the first relevant result in a list of search results.

$$MRR=\frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{rank_i}$$

where 
- $|Q|$ is the total number of queries.
- $rank_i$ is the rank position of the first relevant document of the i-th query.


```python
def calc_MRR(preds, truth, cutoffs):
    mrr = [0 for _ in range(len(cutoffs))]
    for pred, t in zip(preds, truth):
        for i, c in enumerate(cutoffs):
            for j, p in enumerate(pred):
                if j < c and p in t:
                    mrr[i] += 1/(j+1)
                    break
    mrr = [k/len(preds) for k in mrr]
    return mrr
```


```python
mrr = calc_MRR(results, ground_truth, cutoffs)
for i, c in enumerate(cutoffs):
    print(f"MRR@{c}: {mrr[i]}")
```

## 3. nDCG

Normalized Discounted Cumulative Gain (nDCG) measures the quality of a ranked list of search results by considering both the position of the relevant documents and their graded relevance scores. The calculation of nDCG involves two main steps:

1. Discounted cumulative gain (DCG) measures the ranking quality in retrieval tasks.

$$DCG_p=\sum_{i=1}^p\frac{2^{rel_i}-1}{\log_2(i+1)}$$

2. Normalized by ideal DCG to make it comparable across queries.
$$nDCG_p=\frac{DCG_p}{IDCG_p}$$
where $IDCG$ is the maximum possible DCG for a given set of documents, assuming they are perfectly ranked in order of relevance.


```python
pred_hard_encodings = []
for pred, label in zip(results, ground_truth):
    pred_hard_encoding = list(np.isin(pred, label).astype(int))
    pred_hard_encodings.append(pred_hard_encoding)
```


```python
from sklearn.metrics import ndcg_score

for i, c in enumerate(cutoffs):
    nDCG = ndcg_score(pred_hard_encodings, results, k=c)
    print(f"nDCG@{c}: {nDCG}")
```

## 4. Precision

Precision 

$$\textbf{Recall}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Positive}}$$

in information retrieval, it's the ratio of relevant documents retrieved to the totoal number of documents retrieved:

$$\textbf{Recall}=\frac{|\text{\{Relevant docs\}}\cap\text{\{Retrieved docs\}}|}{|\text{\{Retrieved docs\}}|}$$


```python
def calc_precision(preds, truths, cutoffs):
    prec = np.zeros(len(cutoffs))
    for text, truth in zip(preds, truths):
        for i, c in enumerate(cutoffs):
            hits = np.intersect1d(truth, text[:c])
            prec[i] += len(hits) / c
    prec /= len(preds)
    return prec
```


```python
precisions = calc_precision(results, ground_truth, cutoffs)
for i, c in enumerate(cutoffs):
    print(f"precision@{c}: {precisions[i]}")
```

## 5. MAP

Mean Average Precision (MAP) measures the effectiveness of a system at returning relevant documents across multiple queries. 

First, Average Precision (AP) evals how well relevant documents are ranked within the retrieved documents. It's computed by averaging the precision values for each position of relevant document in the ranking of all the retrieved documents:

$$\textbf{AP}=\frac{\sum_{k=1}^{M}\text{Relevance}(k) \times \text{Precision}(k)}{|\{\text{Relevant Docs}\}|}$$

where 
- $M$ is the total number of documents retrieved.
- $\text{Relevance}(k)$ is a binary value, indicating whether document at position $k$ is relevant (=1) or not (=0).
- $\text{Precision}(k)$ is the precision when considering only top $k$ retrieved items.

Then calculate the average AP across multiple queries to get the MAP:

$$\textbf{MAP}=\frac{1}{N}\sum_{i=1}^{N}\text{AP}_i$$

where
- $N$ is the total number of queries.
- $\text{AP}_i$ is the average precision of the $i^{th}$ query.


```python
def calc_AP(encoding):
    rel = 0
    precs = 0.0
    for k, hit in enumerate(encoding, start=1):
        if hit == 1:
            rel += 1
            precs += rel/k

    return 0 if rel == 0 else precs/rel
```


```python
def calc_MAP(encodings, cutoffs):
    res = []
    for c in cutoffs:
        ap_sum = 0.0
        for encoding in encodings:
            ap_sum += calc_AP(encoding[:c])
        res.append(ap_sum/len(encodings))
        
    return res
```


```python
maps = calc_MAP(pred_hard_encodings, cutoffs)
for i, c in enumerate(cutoffs):
    print(f"MAP@{c}: {maps[i]}")
```
