Summary: This tutorial demonstrates how to implement and evaluate reranking models using the FlagEmbedding library, specifically focusing on BGE rerankers. It covers the technical setup and configuration of reranking evaluation pipelines, including crucial parameters like search_top_k, rerank_top_k, and batch sizes. The tutorial helps with tasks related to semantic search optimization, particularly the reranking of embedding-based search results. Key features include multi-GPU support, batch processing configuration, handling different model variants (BGE Reranker Large and V2 M3), and performance evaluation using metrics like NDCG and recall at various k values.

# Evaluate Reranker

Reranker usually better captures the latent semantic meanings between sentences. But comparing to using an embedding model, it will take quadratic $O(N^2)$ running time for the whole dataset. Thus the most common use cases of rerankers in information retrieval or RAG is reranking the top k answers retrieved according to the embedding similarities.

The evaluation of reranker has the similar idea. We compare how much better the rerankers can rerank the candidates searched by a same embedder. In this tutorial, we will evaluate two rerankers' performances on BEIR benchmark, with bge-large-en-v1.5 as the base embedding model.

Note: We highly recommend to run this notebook with GPU. The whole pipeline is very time consuming. For simplicity, we only use a single task FiQA in BEIR.

## 0. Installation

First install the required dependency


```python
%pip install FlagEmbedding
```

## 1. bge-reranker-large

The first model is bge-reranker-large, a BERT like reranker with about 560M parameters.

We can use the evaluation pipeline of FlagEmbedding to directly run the whole process:


```bash
%%bash
python -m FlagEmbedding.evaluation.beir \
--eval_name beir \
--dataset_dir ./beir/data \
--dataset_names fiqa \
--splits test dev \
--corpus_embd_save_dir ./beir/corpus_embd \
--output_dir ./beir/search_results \
--search_top_k 1000 \
--rerank_top_k 100 \
--cache_path /root/.cache/huggingface/hub \
--overwrite True \
--k_values 10 100 \
--eval_output_method markdown \
--eval_output_path ./beir/beir_eval_results.md \
--eval_metrics ndcg_at_10 recall_at_100 \
--ignore_identical_ids True \
--embedder_name_or_path BAAI/bge-large-en-v1.5 \
--reranker_name_or_path BAAI/bge-reranker-large \
--embedder_batch_size 1024 \
--reranker_batch_size 1024 \
--devices cuda:0 \
```

## 2. bge-reranker-v2-gemma

The second model is bge-reranker-v2-m3


```bash
%%bash
python -m FlagEmbedding.evaluation.beir \
--eval_name beir \
--dataset_dir ./beir/data \
--dataset_names fiqa \
--splits test dev \
--corpus_embd_save_dir ./beir/corpus_embd \
--output_dir ./beir/search_results \
--search_top_k 1000 \
--rerank_top_k 100 \
--cache_path /root/.cache/huggingface/hub \
--overwrite True \
--k_values 10 100 \
--eval_output_method markdown \
--eval_output_path ./beir/beir_eval_results.md \
--eval_metrics ndcg_at_10 recall_at_100 \
--ignore_identical_ids True \
--embedder_name_or_path BAAI/bge-large-en-v1.5 \
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
--embedder_batch_size 1024 \
--reranker_batch_size 1024 \
--devices cuda:0 cuda:1 cuda:2 cuda:3 \
--reranker_max_length 1024 \
```

## 3. Comparison


```python
import json

with open('beir/search_results/bge-large-en-v1.5/bge-reranker-large/EVAL/eval_results.json') as f:
    results_1 = json.load(f)
    print(results_1)
    
with open('beir/search_results/bge-large-en-v1.5/bge-reranker-v2-m3/EVAL/eval_results.json') as f:
    results_2 = json.load(f)
    print(results_2)
```

From the above results we can see that bge-reranker-v2-m3 has advantage on almost all the metrics.
