Summary: This tutorial demonstrates the implementation of a neural information retrieval system using the MIRACL dataset, focusing on dense retrieval techniques. It covers essential code for embedding-based search, including data loading from MIRACL corpus, generating embeddings using FlagModel (BGE), efficient similarity search with FAISS indexing, and evaluation using pytrec_eval. Key functionalities include batch processing of embeddings, vector similarity search, and computing standard IR metrics (NDCG, Recall). The tutorial is particularly useful for tasks involving document retrieval, semantic search implementation, and IR system evaluation, with specific guidance on handling large-scale datasets and optimizing performance through batching and caching strategies.

# Evaluate on MIRACL

[MIRACL](https://project-miracl.github.io/) (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages. They release a multilingual retrieval dataset containing the train and dev set for 16 “known languages” and only dev set for 2 “surprise languages”. The topics are generated by native speakers of each language, who also label the relevance between the topics and a given document list. You can found the dataset on HuggingFace.

Note: We highly recommend you to run the evaluation of MIRACL on GPU. For reference, it takes about an hour for the whole process on a 8xA100 40G node.

## 0. Installation

First install the libraries we are using:


```python
% pip install FlagEmbedding pytrec_eval
```

## 1. Dataset

With the great number of passages and articles in the 18 languages. MIRACL is a resourceful dataset for training or evaluating multi-lingual model. The data can be downloaded from [Hugging Face](https://huggingface.co/datasets/miracl/miracl).

| Language        | # of Passages | # of Articles |
|:----------------|--------------:|--------------:|
| Arabic (ar)     |     2,061,414 |       656,982 |
| Bengali (bn)    |       297,265 |        63,762 |
| English (en)    |    32,893,221 |     5,758,285 |
| Spanish (es)    |    10,373,953 |     1,669,181 |
| Persian (fa)    |     2,207,172 |       857,827 |
| Finnish (fi)    |     1,883,509 |       447,815 |
| French (fr)     |    14,636,953 |     2,325,608 |
| Hindi (hi)      |       506,264 |       148,107 |
| Indonesian (id) |     1,446,315 |       446,330 |
| Japanese (ja)   |     6,953,614 |     1,133,444 |
| Korean (ko)     |     1,486,752 |       437,373 |
| Russian (ru)    |     9,543,918 |     1,476,045 |
| Swahili (sw)    |       131,924 |        47,793 |
| Telugu (te)     |       518,079 |        66,353 |
| Thai (th)       |       542,166 |       128,179 |
| Chinese (zh)    |     4,934,368 |     1,246,389 |


```python
from datasets import load_dataset

lang = "en"
corpus = load_dataset("miracl/miracl-corpus", lang, trust_remote_code=True)['train']
```

Each passage in the corpus has three parts: `docid`, `title`, and `text`. In the structure of document with docid `x#y`, `x` indicates the id of Wikipedia article, and `y` is the number of passage within that article. The title is the name of the article with id `x` that passage belongs to. The text is the text body of the passage.


```python
corpus[0]
```

The qrels have following form:


```python
dev = load_dataset('miracl/miracl', lang, trust_remote_code=True)['dev']
```


```python
dev[0]
```

Each item has four parts: `query_id`, `query`, `positive_passages`, and `negative_passages`. Here, `query_id` and `query` correspond to the id and text content of the qeury. `positive_passages` and `negative_passages` are list of passages with their corresponding `docid`, `title`, and `text`. 

This structure is the same in the `train`, `dev`, `testA` and `testB` sets.

Then we process the ids and text of queries and corpus, and get the qrels of the dev set.


```python
corpus_ids = corpus['docid']
corpus_text = []
for doc in corpus:
   corpus_text.append(f"{doc['title']} {doc['text']}".strip())

queries_ids = dev['query_id']
queries_text = dev['query']
```

## 2. Evaluate from scratch

### 2.1 Embedding

In the demo we use bge-base-en-v1.5, feel free to change to the model you prefer.


```python
import os 
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
os.environ['SETUPTOOLS_USE_DISTUTILS'] = ''
```


```python
from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5')

# get the embedding of the queries and corpus
queries_embeddings = model.encode_queries(queries_text)
corpus_embeddings = model.encode_corpus(corpus_text)

print("shape of the embeddings:", corpus_embeddings.shape)
print("data type of the embeddings: ", corpus_embeddings.dtype)
```

### 2.2 Indexing

Create a Faiss index to store the embeddings.


```python
import faiss
import numpy as np

# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768
dim = corpus_embeddings.shape[-1]

# create the faiss index and store the corpus embeddings into the vector space
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
corpus_embeddings = corpus_embeddings.astype(np.float32)
# train and add the embeddings to the index
index.train(corpus_embeddings)
index.add(corpus_embeddings)

print(f"total number of vectors: {index.ntotal}")
```

### 2.3 Searching

Use the Faiss index to search for each query.


```python
from tqdm import tqdm

query_size = len(queries_embeddings)

all_scores = []
all_indices = []

for i in tqdm(range(0, query_size, 32), desc="Searching"):
    j = min(i + 32, query_size)
    query_embedding = queries_embeddings[i: j]
    score, indice = index.search(query_embedding.astype(np.float32), k=100)
    all_scores.append(score)
    all_indices.append(indice)

all_scores = np.concatenate(all_scores, axis=0)
all_indices = np.concatenate(all_indices, axis=0)
```

Then map the search results back to the indices in the dataset.


```python
results = {}
for idx, (scores, indices) in enumerate(zip(all_scores, all_indices)):
    results[queries_ids[idx]] = {}
    for score, index in zip(scores, indices):
        if index != -1:
            results[queries_ids[idx]][corpus_ids[index]] = float(score)
```

### 2.4 Evaluating

Download the qrels file for evaluation:


```python
endpoint = os.getenv('HF_ENDPOINT', 'https://huggingface.co')
file_name = "qrels.miracl-v1.0-en-dev.tsv"
qrel_url = f"wget {endpoint}/datasets/miracl/miracl/resolve/main/miracl-v1.0-en/qrels/{file_name}"

os.system(qrel_url)
```

Read the qrels from the file:


```python
qrels_dict = {}
with open(file_name, "r", encoding="utf-8") as f:
    for line in f.readlines():
        qid, _, docid, rel = line.strip().split("\t")
        qid, docid, rel = str(qid), str(docid), int(rel)
        if qid not in qrels_dict:
            qrels_dict[qid] = {}
        qrels_dict[qid][docid] = rel
```

Finally, use [pytrec_eval](https://github.com/cvangysel/pytrec_eval) library to help us calculate the scores of selected metrics:


```python
import pytrec_eval
from collections import defaultdict

ndcg_string = "ndcg_cut." + ",".join([str(k) for k in [10,100]])
recall_string = "recall." + ",".join([str(k) for k in [10,100]])

evaluator = pytrec_eval.RelevanceEvaluator(
    qrels_dict, {ndcg_string, recall_string}
)
scores = evaluator.evaluate(results)

all_ndcgs, all_recalls = defaultdict(list), defaultdict(list)
for query_id in scores.keys():
    for k in [10,100]:
        all_ndcgs[f"NDCG@{k}"].append(scores[query_id]["ndcg_cut_" + str(k)])
        all_recalls[f"Recall@{k}"].append(scores[query_id]["recall_" + str(k)])

ndcg, recall = (
    all_ndcgs.copy(),
    all_recalls.copy(),
)

for k in [10,100]:
    ndcg[f"NDCG@{k}"] = round(sum(ndcg[f"NDCG@{k}"]) / len(scores), 5)
    recall[f"Recall@{k}"] = round(sum(recall[f"Recall@{k}"]) / len(scores), 5)

print(ndcg)
print(recall)
```

## 3. Evaluate using FlagEmbedding

We provide independent evaluation for popular datasets and benchmarks. Try the following code to run the evaluation, or run the shell script provided in [example](../../examples/evaluation/miracl/eval_miracl.sh) folder.


```python
import sys

arguments = """- \
    --eval_name miracl \
    --dataset_dir ./miracl/data \
    --dataset_names en \
    --splits dev \
    --corpus_embd_save_dir ./miracl/corpus_embd \
    --output_dir ./miracl/search_results \
    --search_top_k 100 \
    --cache_path ./cache/data \
    --overwrite True \
    --k_values 10 100 \
    --eval_output_method markdown \
    --eval_output_path ./miracl/miracl_eval_results.md \
    --eval_metrics ndcg_at_10 recall_at_100 \
    --embedder_name_or_path BAAI/bge-base-en-v1.5 \
    --devices cuda:0 cuda:1 \
    --embedder_batch_size 1024
""".replace('\n','')

sys.argv = arguments.split()
```


```python
from transformers import HfArgumentParser

from FlagEmbedding.evaluation.miracl import (
    MIRACLEvalArgs, MIRACLEvalModelArgs,
    MIRACLEvalRunner
)


parser = HfArgumentParser((
    MIRACLEvalArgs,
    MIRACLEvalModelArgs
))

eval_args, model_args = parser.parse_args_into_dataclasses()
eval_args: MIRACLEvalArgs
model_args: MIRACLEvalModelArgs

runner = MIRACLEvalRunner(
    eval_args=eval_args,
    model_args=model_args
)

runner.run()
```


```python
with open('miracl/search_results/bge-base-en-v1.5/NoReranker/EVAL/eval_results.json', 'r') as content_file:
    print(content_file.read())
```
