Summary: This tutorial demonstrates how to implement and evaluate embedding models using the MTEB (Massive Text Embedding Benchmark) framework. It covers implementation techniques for running comprehensive evaluations across 7 task types (Classification, Clustering, Pair Classification, Reranking, Retrieval, STS, and Summarization) using sentence-transformers. The tutorial helps with tasks like model evaluation, leaderboard submission, and partial task-specific testing. Key features include full and selective dataset evaluation, automated model card generation, and detailed metrics calculation for each task type. It's particularly useful for benchmarking embedding models and preparing submissions for the MTEB leaderboard.

# MTEB Leaderboard

In the last tutorial we show how to evaluate an embedding model on an dataset supported by MTEB. In this tutorial, we will go through how to do a full evaluation and compare the results with MTEB English leaderboard.

Caution: Evaluation on the full Eng MTEB is very time consuming even with GPU. So we encourage you to go through the notebook to have an idea. And run the experiment when you have enough computing resource and time.

## 0. Installation

Install the packages we will use in your environment:


```python
%%capture
%pip install sentence_transformers mteb
```

## 1. Run the Evaluation

The MTEB English leaderboard contains 56 datasets on 7 tasks:
1. **Classification**: Use the embeddings to train a logistic regression on the train set and is scored on the test set. F1 is the main metric.
2. **Clustering**: Train a mini-batch k-means model with batch size 32 and k equals to the number of different labels. Then score using v-measure.
3. **Pair Classification**: A pair of text inputs is provided and a label which is a binary variable needs to be assigned. The main metric is average precision score.
4. **Reranking**: Rank a list of relevant and irrelevant reference texts according to a query. Metrics are mean MRR@k and MAP.
5. **Retrieval**: Each dataset comprises corpus, queries, and a mapping that links each query to its relevant documents within the corpus. The goal is to retrieve relevant documents for each query. The main metric is nDCG@k. MTEB directly adopts BEIR for the retrieval task.
6. **Semantic Textual Similarity (STS)**: Determine the similarity between each sentence pair. Spearman correlation based on cosine
similarity serves as the main metric.
7. **Summarization**: Only 1 dataset is used in this task. Score the machine-generated summaries to human-written summaries by computing distances of their embeddings. The main metric is also Spearman correlation based on cosine similarity.

The benchmark is widely accepted by researchers and engineers to fairly evaluate and compare the performance of the models they train. Now let's take a look at the whole evaluation pipeline

Import the `MTEB_MAIN_EN` to check the all 56 datasets.


```python
import mteb
from mteb.benchmarks import MTEB_MAIN_EN

print(MTEB_MAIN_EN.tasks)
```

Load the model we want to evaluate:


```python
from sentence_transformers import SentenceTransformer

model_name = "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(model_name)
```

Alternatively, MTEB provides popular models on their leaderboard in order to reproduce their results.


```python
model_name = "BAAI/bge-base-en-v1.5"
model = mteb.get_model(model_name)
```

Then start to evaluate on each dataset:


```python
for task in MTEB_MAIN_EN.tasks:
    # get the test set to evaluate on
    eval_splits = ["dev"] if task == "MSMARCO" else ["test"]
    evaluation = mteb.MTEB(
        tasks=[task], task_langs=["en"]
    )  # Remove "en" to run all available languages
    evaluation.run(
        model, output_folder="results", eval_splits=eval_splits
    )
```

## 2. Submit to MTEB Leaderboard

After the evaluation is done, all the evaluation results should be stored in `results/{model_name}/{model_revision}`.

Then run the following shell command to create the model_card.md. Change {model_name} and {model_revision} to your path.


```python
!mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md
```

For the case that the readme of that model already exists:


```python
# !mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md --from_existing your_existing_readme.md 
```

Copy and paste the contents of model_card.md to the top of README.md of your model on HF Hub. Now relax and wait for the daily refresh of leaderboard. Your model will show up soon!

## 3. Partially Evaluate

Note that you don't need to finish all the tasks to get on to the leaderboard.

For example you fine-tune a model's ability on clustering. And you only care about how your model performs with respoect to clustering, but not the other tasks. Then you can just test its performance on the clustering tasks of MTEB and submit to the leaderboard.


```python
TASK_LIST_CLUSTERING = [
    "ArxivClusteringP2P",
    "ArxivClusteringS2S",
    "BiorxivClusteringP2P",
    "BiorxivClusteringS2S",
    "MedrxivClusteringP2P",
    "MedrxivClusteringS2S",
    "RedditClustering",
    "RedditClusteringP2P",
    "StackExchangeClustering",
    "StackExchangeClusteringP2P",
    "TwentyNewsgroupsClustering",
]
```

Run the evaluation with only clustering tasks:


```python
evaluation = mteb.MTEB(tasks=TASK_LIST_CLUSTERING)

results = evaluation.run(model, output_folder="results")
```

Then repeat Step 2 to submit your model. After the leaderboard refresh, you can find your model in the "Clustering" section of the leaderboard.

## 4. Future Work

MTEB is working on a new version of English benchmark. It contains updated and concise tasks and will make the evaluation process faster.

Please check out their [GitHub](https://github.com/embeddings-benchmark/mteb) page for future updates and releases.
