Summary: This tutorial demonstrates two approaches for implementing BEIR (Benchmarking IR) evaluations: direct BEIR evaluation and FlagEmbedding evaluation. It provides code for dataset loading, model initialization, and retrieval evaluation using dense retrieval methods. Key functionalities include setting up logging, downloading and processing BEIR datasets, configuring evaluation parameters, and computing metrics like NDCG, MAP, recall, and precision. The tutorial is particularly useful for tasks involving information retrieval evaluation, dense passage retrieval, and benchmark testing of embedding models, with specific implementation details for both BEIR's native evaluation pipeline and the FlagEmbedding framework.

# Evaluate on BEIR

[BEIR](https://github.com/beir-cellar/beir) (Benchmarking-IR) is a heterogeneous evaluation benchmark for information retrieval. 
It is designed for evaluating the performance of NLP-based retrieval models and widely used by research of modern embedding models.

## 0. Installation

First install the libraries we are using:


```python
% pip install beir FlagEmbedding
```

## 1. Evaluate using BEIR

BEIR contains 18 datasets which can be downloaded from the [link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/), while 4 of them are private datasets that need appropriate licences. If you want to access to those 4 datasets, take a look at their [wiki](https://github.com/beir-cellar/beir/wiki/Datasets-available) for more information. Information collected and codes adapted from BEIR GitHub [repo](https://github.com/beir-cellar/beir).

| Dataset Name | Type     |  Queries  | Documents | Avg. Docs/Q | Public | 
| ---------| :-----------: | ---------| --------- | ------| :------------:| 
| ``msmarco`` | `Train` `Dev` `Test` | 6,980   |  8.84M     |    1.1 | Yes |  
| ``trec-covid``| `Test` | 50|  171K| 493.5 | Yes | 
| ``nfcorpus``  | `Train` `Dev` `Test` |  323     |  3.6K     |  38.2 | Yes |
| ``bioasq``| `Train` `Test` |    500    |  14.91M    |  8.05 | No | 
| ``nq``| `Train` `Test`   |  3,452   |  2.68M  |  1.2 | Yes | 
| ``hotpotqa``| `Train` `Dev` `Test`   |  7,405   |  5.23M  |  2.0 | Yes |
| ``fiqa``    | `Train` `Dev` `Test`     |  648     |  57K    |  2.6 | Yes | 
| ``signal1m`` | `Test`     |   97   |  2.86M  |  19.6 | No |
| ``trec-news``    | `Test`     |   57    |  595K    |  19.6 | No |
| ``arguana`` | `Test`       |  1,406     |  8.67K    |  1.0 | Yes |
| ``webis-touche2020``| `Test` |   49     |  382K    |  49.2 |  Yes |
| ``cqadupstack``| `Test`      |   13,145 |  457K  |  1.4 |  Yes |
| ``quora``| `Dev` `Test`  |   10,000     |  523K    |  1.6 |  Yes | 
| ``dbpedia-entity``| `Dev` `Test` |   400    |  4.63M    |  38.2 |  Yes | 
| ``scidocs``| `Test` |    1,000     |  25K    |  4.9 |  Yes | 
| ``fever``| `Train` `Dev` `Test`     |   6,666     |  5.42M    |  1.2|  Yes | 
| ``climate-fever``| `Test` |  1,535     |  5.42M |  3.0 |  Yes |
| ``scifact``| `Train` `Test` |  300     |  5K    |  1.1 |  Yes |

### 1.1 Load Dataset

First prepare the logging setup.


```python
import logging
from beir import LoggingHandler

logging.basicConfig(format='%(message)s',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
```

In this demo, we choose the `arguana` dataset for a quick demonstration.


```python
import os
from beir import util

url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/arguana.zip"
out_dir = os.path.join(os.getcwd(), "data")
data_path = util.download_and_unzip(url, out_dir)
print(f"Dataset is stored at: {data_path}")
```


```python
from beir.datasets.data_loader import GenericDataLoader

corpus, queries, qrels = GenericDataLoader("data/arguana").load(split="test")
```

### 1.2 Evaluation

Then we load `bge-base-en-v1.5` from huggingface and evaluate its performance on arguana.


```python
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval import models
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES


# Load bge model using Sentence Transformers
model = DRES(models.SentenceBERT("BAAI/bge-base-en-v1.5"), batch_size=128)
retriever = EvaluateRetrieval(model, score_function="cos_sim")

# Get the searching results
results = retriever.retrieve(corpus, queries)
```


```python
logging.info("Retriever evaluation for k in: {}".format(retriever.k_values))
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
```

## 2. Evaluate using FlagEmbedding

We provide independent evaluation for popular datasets and benchmarks. Try the following code to run the evaluation, or run the shell script provided in [example](../../examples/evaluation/beir/eval_beir.sh) folder.

Load the arguments:


```python
import sys

arguments = """-
    --eval_name beir 
    --dataset_dir ./beir/data 
    --dataset_names arguana
    --splits test dev 
    --corpus_embd_save_dir ./beir/corpus_embd 
    --output_dir ./beir/search_results 
    --search_top_k 1000 
    --rerank_top_k 100 
    --cache_path /root/.cache/huggingface/hub 
    --overwrite True 
    --k_values 10 100 
    --eval_output_method markdown 
    --eval_output_path ./beir/beir_eval_results.md 
    --eval_metrics ndcg_at_10 recall_at_100 
    --ignore_identical_ids True 
    --embedder_name_or_path BAAI/bge-base-en-v1.5 
    --embedder_batch_size 1024
    --devices cuda:4
""".replace('\n','')

sys.argv = arguments.split()
```

Then pass the arguments to HFArgumentParser and run the evaluation.


```python
from transformers import HfArgumentParser

from FlagEmbedding.evaluation.beir import (
    BEIREvalArgs, BEIREvalModelArgs,
    BEIREvalRunner
)


parser = HfArgumentParser((
    BEIREvalArgs,
    BEIREvalModelArgs
))

eval_args, model_args = parser.parse_args_into_dataclasses()
eval_args: BEIREvalArgs
model_args: BEIREvalModelArgs

runner = BEIREvalRunner(
    eval_args=eval_args,
    model_args=model_args
)

runner.run()
```

Take a look at the results and choose the way you prefer!


```python
with open('beir/search_results/bge-base-en-v1.5/NoReranker/EVAL/eval_results.json', 'r') as content_file:
    print(content_file.read())
```
