Summary: This tutorial demonstrates how to implement and evaluate dense retrieval on the MLDR dataset using FlagEmbedding. It covers essential techniques for text embedding generation using BAAI/bge-base models, efficient similarity search with FAISS indexing, and evaluation using pytrec_eval. Key functionalities include dataset loading, batch processing of embeddings, vector similarity search, and computing standard IR metrics (NDCG, Recall). The tutorial helps with tasks like setting up a dense retrieval pipeline, implementing efficient search mechanisms, and evaluating retrieval performance. It also provides a simplified alternative using FlagEmbedding's built-in evaluation tools, making it valuable for both detailed custom implementations and quick evaluations.

# Evaluate on MLDR

[MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a Multilingual Long-Document Retrieval dataset built on Wikipeida, Wudao and mC4, covering 13 typologically diverse languages. Specifically, we sample lengthy articles from Wikipedia, Wudao and mC4 datasets and randomly choose paragraphs from them. Then we use GPT-3.5 to generate questions based on these paragraphs. The generated question and the sampled article constitute a new text pair to the dataset.

## 0. Installation

First install the libraries we are using:


```python
% pip install FlagEmbedding pytrec_eval
```

## 1. Dataset

Download the dataset of 13 different languages from [Hugging Face](https://huggingface.co/datasets/Shitao/MLDR).

| Language Code |  Language  |      Source      | #train  | #dev  | #test | #corpus | Avg. Length of Docs |
| :-----------: | :--------: | :--------------: | :-----: | :---: | :---: | :-----: | :-----------------: |
|      ar       |   Arabic   |    Wikipedia     |  1,817  |  200  |  200  |  7,607  |        9,428        |
|      de       |   German   |  Wikipedia, mC4  |  1,847  |  200  |  200  | 10,000  |        9,039        |
|      en       |  English   |    Wikipedia     | 10,000 |  200  |  800  | 200,000 |        3,308        |
|      es       |  Spanish   |  Wikipedia, mc4  |  2,254  |  200  |  200  |  9,551  |        8,771        |
|      fr       |   French   |    Wikipedia     |  1,608  |  200  |  200  | 10,000  |        9,659        |
|      hi       |   Hindi    |    Wikipedia     |  1,618  |  200  |  200  |  3,806  |        5,555        |
|      it       |  Italian   |    Wikipedia     |  2,151  |  200  |  200  | 10,000  |        9,195        |
|      ja       |  Japanese  |    Wikipedia     |  2,262  |  200  |  200  | 10,000  |        9,297        |
|      ko       |   Korean   |    Wikipedia     |  2,198  |  200  |  200  |  6,176  |        7,832        |
|      pt       | Portuguese |    Wikipedia     |  1,845  |  200  |  200  |  6,569  |        7,922        |
|      ru       |  Russian   |    Wikipedia     |  1,864  |  200  |  200  | 10,000  |        9,723        |
|      th       |    Thai    |       mC4        |  1,970  |  200  |  200  | 10,000  |        8,089        |
|      zh       |  Chinese   | Wikipedia, Wudao | 10,000  |  200  |  800  | 200,000 |        4,249        |
|     Total     |     -      |        -         | 41,434  | 2,600 | 3,800 | 493,709 |        4,737        |

First download the queries and corresponding qrels:


```python
from datasets import load_dataset

lang = "en"
dataset = load_dataset('Shitao/MLDR', lang, trust_remote_code=True)
```

Each item has four parts: `query_id`, `query`, `positive_passages`, and `negative_passages`. `query_id` and `query` correspond to the id and text content of the qeury. `positive_passages` and `negative_passages` are list of passages with their corresponding `docid` and `text`. 


```python
dataset['dev'][0]
```

Each passage in the corpus has two parts: `docid` and `text`. `docid` has the form of `doc-<language>-<id>`


```python
corpus = load_dataset('Shitao/MLDR', f"corpus-{lang}", trust_remote_code=True)['corpus']
```


```python
corpus[0]
```

Then we process the ids and text of queries and corpus for preparation of embedding and searching.


```python
corpus_ids = corpus['docid']
corpus_text = corpus['text']

queries_ids = dataset['dev']['query_id']
queries_text = dataset['dev']['query']
```

## 2. Evaluate from scratch

### 2.1 Embedding

In the demo we use bge-base-en-v1.5, feel free to change to the model you prefer.


```python
import os 
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
```


```python
from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5',)
                #   query_instruction_for_retrieval="Represent this sentence for searching relevant passages:")

# get the embedding of the queries and corpus
queries_embeddings = model.encode_queries(queries_text)
corpus_embeddings = model.encode_corpus(corpus_text)

print("shape of the embeddings:", corpus_embeddings.shape)
print("data type of the embeddings: ", corpus_embeddings.dtype)
```

### 2.2 Indexing

Create a Faiss index to store the embeddings.


```python
import faiss
import numpy as np

# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768
dim = corpus_embeddings.shape[-1]

# create the faiss index and store the corpus embeddings into the vector space
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
corpus_embeddings = corpus_embeddings.astype(np.float32)
# train and add the embeddings to the index
index.train(corpus_embeddings)
index.add(corpus_embeddings)

print(f"total number of vectors: {index.ntotal}")
```

### 2.3 Searching

Use the Faiss index to search answers for each query.


```python
from tqdm import tqdm

query_size = len(queries_embeddings)

all_scores = []
all_indices = []

for i in tqdm(range(0, query_size, 32), desc="Searching"):
    j = min(i + 32, query_size)
    query_embedding = queries_embeddings[i: j]
    score, indice = index.search(query_embedding.astype(np.float32), k=100)
    all_scores.append(score)
    all_indices.append(indice)

all_scores = np.concatenate(all_scores, axis=0)
all_indices = np.concatenate(all_indices, axis=0)
```


```python
results = {}
for idx, (scores, indices) in enumerate(zip(all_scores, all_indices)):
    results[queries_ids[idx]] = {}
    for score, index in zip(scores, indices):
        if index != -1:
            results[queries_ids[idx]][corpus_ids[index]] = float(score)
```

### 2.4 Evaluating

Process the qrels into a dictionary with qid-docid pairs.


```python
qrels_dict = {}
for data in dataset['dev']:
    qid = str(data["query_id"])
    if qid not in qrels_dict:
        qrels_dict[qid] = {}
    for doc in data["positive_passages"]:
        docid = str(doc["docid"])
        qrels_dict[qid][docid] = 1
    for doc in data["negative_passages"]:
        docid = str(doc["docid"])
        qrels_dict[qid][docid] = 0
```

Finally, use [pytrec_eval](https://github.com/cvangysel/pytrec_eval) library to help us calculate the scores of selected metrics:


```python
import pytrec_eval
from collections import defaultdict

ndcg_string = "ndcg_cut." + ",".join([str(k) for k in [10,100]])
recall_string = "recall." + ",".join([str(k) for k in [10,100]])

evaluator = pytrec_eval.RelevanceEvaluator(
    qrels_dict, {ndcg_string, recall_string}
)
scores = evaluator.evaluate(results)

all_ndcgs, all_recalls = defaultdict(list), defaultdict(list)
for query_id in scores.keys():
    for k in [10,100]:
        all_ndcgs[f"NDCG@{k}"].append(scores[query_id]["ndcg_cut_" + str(k)])
        all_recalls[f"Recall@{k}"].append(scores[query_id]["recall_" + str(k)])

ndcg, recall = (
    all_ndcgs.copy(),
    all_recalls.copy(),
)

for k in [10,100]:
    ndcg[f"NDCG@{k}"] = round(sum(ndcg[f"NDCG@{k}"]) / len(scores), 5)
    recall[f"Recall@{k}"] = round(sum(recall[f"Recall@{k}"]) / len(scores), 5)

print(ndcg)
print(recall)
```

## 3. Evaluate using FlagEmbedding

We provide independent evaluation for popular datasets and benchmarks. Try the following code to run the evaluation, or run the shell script provided in [example](../../examples/evaluation/mldr/eval_mldr.sh) folder.


```python
import sys

arguments = """- \
    --eval_name mldr \
    --dataset_dir ./mldr/data \
    --dataset_names en \
    --splits dev \
    --corpus_embd_save_dir ./mldr/corpus_embd \
    --output_dir ./mldr/search_results \
    --search_top_k 1000 \
    --cache_path ./cache/data \
    --overwrite False \
    --k_values 10 100 \
    --eval_output_method markdown \
    --eval_output_path ./mldr/mldr_eval_results.md \
    --eval_metrics ndcg_at_10 \
    --embedder_name_or_path BAAI/bge-base-en-v1.5 \
    --devices cuda:0 cuda:1 \
    --embedder_batch_size 1024
""".replace('\n','')

sys.argv = arguments.split()
```


```python
from transformers import HfArgumentParser

from FlagEmbedding.evaluation.mldr import (
    MLDREvalArgs, MLDREvalModelArgs,
    MLDREvalRunner
)


parser = HfArgumentParser((
    MLDREvalArgs,
    MLDREvalModelArgs
))

eval_args, model_args = parser.parse_args_into_dataclasses()
eval_args: MLDREvalArgs
model_args: MLDREvalModelArgs

runner = MLDREvalRunner(
    eval_args=eval_args,
    model_args=model_args
)

runner.run()
```


```python
with open('mldr/search_results/bge-base-en-v1.5/NoReranker/EVAL/eval_results.json', 'r') as content_file:
    print(content_file.read())
```
