Summary: This tutorial demonstrates how to build and evaluate a text embedding and retrieval system using FlagEmbedding and FAISS. It covers implementation techniques for generating embeddings with the BGE model, creating FAISS indexes for efficient similarity search, and evaluating retrieval performance using multiple metrics (Recall, MRR, nDCG). The tutorial helps with tasks like vector similarity search, batch processing of queries, and implementing evaluation metrics for information retrieval systems. Key features include FP16 optimization, FAISS index management, batch processing for large datasets, and comprehensive evaluation metric implementations with configurable cut-off points.

# Evaluation

Evaluation is a crucial part in all machine learning tasks. In this notebook, we will walk through the whole pipeline of evaluating the performance of an embedding model on [MS Marco](https://microsoft.github.io/msmarco/), and use three metrics to show its performance.

## Step 0: Setup

Install the dependencies in the environment.


```python
%pip install -U FlagEmbedding faiss-cpu
```

## Step 1: Load Dataset

First, download the queries and MS Marco from Huggingface Dataset


```python
from datasets import load_dataset
import numpy as np

data = load_dataset("namespace-Pt/msmarco", split="dev")
```

Considering time cost, we will use the truncated dataset in this tutorial. `queries` contains the first 100 queries from the dataset. `corpus` is formed by the positives of the the first 5,000 queries.


```python
queries = np.array(data[:100]["query"])
corpus = sum(data[:5000]["positive"], [])
```

If you have GPU and would like to try out the full evaluation of MS Marco, uncomment and run the following cell:


```python
# data = load_dataset("namespace-Pt/msmarco", split="dev")
# queries = np.array(data["query"])

# corpus = load_dataset("namespace-PT/msmarco-corpus", split="train")
```

## Step 2: Embedding

Choose the embedding model that we would like to evaluate, and encode the corpus to embeddings.


```python
from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

# get the embedding of the corpus
corpus_embeddings = model.encode(corpus)

print("shape of the corpus embeddings:", corpus_embeddings.shape)
print("data type of the embeddings: ", corpus_embeddings.dtype)
```

## Step 3: Indexing

We use the index_factory() functions to create a Faiss index we want:

- The first argument `dim` is the dimension of the vector space, in this case is 768 if you're using bge-base-en-v1.5.

- The second argument `'Flat'` makes the index do exhaustive search.

- The thrid argument `faiss.METRIC_INNER_PRODUCT` tells the index to use inner product as the distance metric.


```python
import faiss

# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768
dim = corpus_embeddings.shape[-1]

# create the faiss index and store the corpus embeddings into the vector space
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
corpus_embeddings = corpus_embeddings.astype(np.float32)
# train and add the embeddings to the index
index.train(corpus_embeddings)
index.add(corpus_embeddings)

print(f"total number of vectors: {index.ntotal}")
```

Since the embedding process is time consuming, it's a good choice to save the index for reproduction or other experiments.

Uncomment the following lines to save the index.


```python
# path = "./index.bin"
# faiss.write_index(index, path)
```

If you already have stored index in your local directory, you can load it by:


```python
# index = faiss.read_index("./index.bin")
```

## Step 4: Retrieval

Get the embeddings of all the queries, and get their corresponding ground truth answers for evaluation.


```python
query_embeddings = model.encode_queries(queries)
ground_truths = [d["positive"] for d in data]
corpus = np.asarray(corpus)
```

Use the faiss index to search top $k$ answers of each query.


```python
from tqdm import tqdm

res_scores, res_ids, res_text = [], [], []
query_size = len(query_embeddings)
batch_size = 256
# The cutoffs we will use during evaluation, and set k to be the maximum of the cutoffs.
cut_offs = [1, 10]
k = max(cut_offs)

for i in tqdm(range(0, query_size, batch_size), desc="Searching"):
    q_embedding = query_embeddings[i: min(i+batch_size, query_size)].astype(np.float32)
    # search the top k answers for each of the queries
    score, idx = index.search(q_embedding, k=k)
    res_scores += list(score)
    res_ids += list(idx)
    res_text += list(corpus[idx])
```

## Step 5: Evaluate

### 5.1 Recall

Recall represents the model's capability of correctly predicting positive instances from all the actual positive samples in the dataset.

$$\textbf{Recall}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}}$$

Recall is useful when the cost of false negatives is high. In other words, we are trying to find all objects of the positive class, even if this results in some false positives. This attribute makes recall a useful metric for text retrieval tasks.


```python
def calc_recall(preds, truths, cutoffs):
    recalls = np.zeros(len(cutoffs))
    for text, truth in zip(preds, truths):
        for i, c in enumerate(cutoffs):
            recall = np.intersect1d(truth, text[:c])
            recalls[i] += len(recall) / max(min(c, len(truth)), 1)
    recalls /= len(preds)
    return recalls

recalls = calc_recall(res_text, ground_truths, cut_offs)
for i, c in enumerate(cut_offs):
    print(f"recall@{c}: {recalls[i]}")
```

### 5.2 MRR

Mean Reciprocal Rank ([MRR](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) is a widely used metric in information retrieval to evaluate the effectiveness of a system. It measures the rank position of the first relevant result in a list of search results.

$$MRR=\frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{rank_i}$$

where 
- $|Q|$ is the total number of queries.
- $rank_i$ is the rank position of the first relevant document of the i-th query.


```python
def MRR(preds, truth, cutoffs):
    mrr = [0 for _ in range(len(cutoffs))]
    for pred, t in zip(preds, truth):
        for i, c in enumerate(cutoffs):
            for j, p in enumerate(pred):
                if j < c and p in t:
                    mrr[i] += 1/(j+1)
                    break
    mrr = [k/len(preds) for k in mrr]
    return mrr
```


```python
mrr = MRR(res_text, ground_truths, cut_offs)
for i, c in enumerate(cut_offs):
    print(f"MRR@{c}: {mrr[i]}")
```

### 5.3 nDCG

Normalized Discounted cumulative gain (nDCG) measures the quality of a ranked list of search results by considering both the position of the relevant documents and their graded relevance scores. The calculation of nDCG involves two main steps:

1. Discounted cumulative gain (DCG) measures the ranking quality in retrieval tasks.

$$DCG_p=\sum_{i=1}^p\frac{2^{rel_i}-1}{\log_2(i+1)}$$

2. Normalized by ideal DCG to make it comparable across queries.
$$nDCG_p=\frac{DCG_p}{IDCG_p}$$
where $IDCG$ is the maximum possible DCG for a given set of documents, assuming they are perfectly ranked in order of relevance.


```python
pred_hard_encodings = []
for pred, label in zip(res_text, ground_truths):
    pred_hard_encoding = list(np.isin(pred, label).astype(int))
    pred_hard_encodings.append(pred_hard_encoding)
```


```python
from sklearn.metrics import ndcg_score

for i, c in enumerate(cut_offs):
    nDCG = ndcg_score(pred_hard_encodings, res_scores, k=c)
    print(f"nDCG@{c}: {nDCG}")
```

Congrats! You have walked through a full pipeline of evaluating an embedding model. Feel free to play with different datasets and models!
