Summary: This tutorial demonstrates how to implement MTEB (Massive Text Embedding Benchmark) evaluation for embedding models using sentence-transformers. It provides code for setting up MTEB evaluations, loading pre-trained models, selecting specific NLP tasks (like retrieval tasks), and running benchmarks with detailed metrics. The tutorial helps with tasks involving model evaluation, embedding quality assessment, and performance benchmarking, covering key functionalities such as MAP, MRR, NDCG scoring, and multi-task evaluation. It's particularly useful for implementing systematic evaluation pipelines for text embedding models and understanding their performance across different retrieval tasks.

# MTEB

For evaluation of embedding models, MTEB is one of the most well-known benchmark. In this tutorial, we'll introduce MTEB, its basic usage, and evaluate how your model performs on the MTEB leaderboard.

## 0. Installation

Install the packages we will use in your environment:


```python
%%capture
%pip install sentence_transformers mteb
```

## 1. Intro

The [Massive Text Embedding Benchmark (MTEB)](https://github.com/embeddings-benchmark/mteb) is a large-scale evaluation framework designed to assess the performance of text embedding models across a wide variety of natural language processing (NLP) tasks. Introduced to standardize and improve the evaluation of text embeddings, MTEB is crucial for assessing how well these models generalize across various real-world applications. It contains a wide range of datasets in eight main NLP tasks and different languages, and provides an easy pipeline for evaluation.

MTEB is also well known for the MTEB leaderboard, which contains a ranking of the latest first-class embedding models. We'll cover that in the next tutorial. Now let's have a look on how to use MTEB to do evaluation easily.


```python
import mteb
from sentence_transformers import SentenceTransformer
```

Now let's take a look at how to use MTEB to do a quick evaluation.

First we load the model that we would like to evaluate on:


```python
model_name = "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(model_name)
```

Below is the list of datasets of retrieval used by MTEB's English leaderboard.

MTEB directly use the open source benchmark BEIR in its retrieval part, which contains 15 datasets (note there are 12 subsets of CQADupstack).


```python
retrieval_tasks = [
    "ArguAna",
    "ClimateFEVER",
    "CQADupstackAndroidRetrieval",
    "CQADupstackEnglishRetrieval",
    "CQADupstackGamingRetrieval",
    "CQADupstackGisRetrieval",
    "CQADupstackMathematicaRetrieval",
    "CQADupstackPhysicsRetrieval",
    "CQADupstackProgrammersRetrieval",
    "CQADupstackStatsRetrieval",
    "CQADupstackTexRetrieval",
    "CQADupstackUnixRetrieval",
    "CQADupstackWebmastersRetrieval",
    "CQADupstackWordpressRetrieval",
    "DBPedia",
    "FEVER",
    "FiQA2018",
    "HotpotQA",
    "MSMARCO",
    "NFCorpus",
    "NQ",
    "QuoraRetrieval",
    "SCIDOCS",
    "SciFact",
    "Touche2020",
    "TRECCOVID",
]
```

For demonstration, let's just run the first one, "ArguAna".

For a full list of tasks and languages that MTEB supports, check the [page](https://github.com/embeddings-benchmark/mteb/blob/18662380f0f476db3d170d0926892045aa9f74ee/docs/tasks.md).


```python
tasks = mteb.get_tasks(tasks=retrieval_tasks[:1])
```

Then, create and initialize an MTEB instance with our chosen tasks, and run the evaluation process.


```python
# use the tasks we chose to initialize the MTEB instance
evaluation = mteb.MTEB(tasks=tasks)

# call run() with the model and output_folder
results = evaluation.run(model, output_folder="results")
```

The results should be stored in `{output_folder}/{model_name}/{model_revision}/{task_name}.json`.

Openning the json file you should see contents as below, which are the evaluation results on "ArguAna" with different metrics on cutoffs from 1 to 1000.

```python
{
  "dataset_revision": "c22ab2a51041ffd869aaddef7af8d8215647e41a",
  "evaluation_time": 260.14976954460144,
  "kg_co2_emissions": null,
  "mteb_version": "1.14.17",
  "scores": {
    "test": [
      {
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ],
        "main_score": 0.63616,
        "map_at_1": 0.40754,
        "map_at_10": 0.55773,
        "map_at_100": 0.56344,
        "map_at_1000": 0.56347,
        "map_at_20": 0.56202,
        "map_at_3": 0.51932,
        "map_at_5": 0.54023,
        "mrr_at_1": 0.4139402560455192,
        "mrr_at_10": 0.5603739077423295,
        "mrr_at_100": 0.5660817425350153,
        "mrr_at_1000": 0.5661121884705748,
        "mrr_at_20": 0.564661930998293,
        "mrr_at_3": 0.5208629682313899,
        "mrr_at_5": 0.5429113323850182,
        "nauc_map_at_1000_diff1": 0.15930478114759905,
        "nauc_map_at_1000_max": -0.06396189194646361,
        "nauc_map_at_1000_std": -0.13168797291549253,
        "nauc_map_at_100_diff1": 0.15934819555197366,
        "nauc_map_at_100_max": -0.06389635013430676,
        "nauc_map_at_100_std": -0.13164524259533786,
        "nauc_map_at_10_diff1": 0.16057318234658585,
        "nauc_map_at_10_max": -0.060962623117325254,
        "nauc_map_at_10_std": -0.1300413865104607,
        "nauc_map_at_1_diff1": 0.17346152653542332,
        "nauc_map_at_1_max": -0.09705499215630589,
        "nauc_map_at_1_std": -0.14726476953035533,
        "nauc_map_at_20_diff1": 0.15956349246366208,
        "nauc_map_at_20_max": -0.06259296677860492,
        "nauc_map_at_20_std": -0.13097093150054095,
        "nauc_map_at_3_diff1": 0.15620049317363813,
        "nauc_map_at_3_max": -0.06690213479396273,
        "nauc_map_at_3_std": -0.13440904793529648,
        "nauc_map_at_5_diff1": 0.1557795701081579,
        "nauc_map_at_5_max": -0.06255283252590663,
        "nauc_map_at_5_std": -0.1355361594910923,
        "nauc_mrr_at_1000_diff1": 0.1378988612808882,
        "nauc_mrr_at_1000_max": -0.07507962333910836,
        "nauc_mrr_at_1000_std": -0.12969109830101241,
        "nauc_mrr_at_100_diff1": 0.13794450668758515,
        "nauc_mrr_at_100_max": -0.07501290390362861,
        "nauc_mrr_at_100_std": -0.12964855554504057,
        "nauc_mrr_at_10_diff1": 0.1396047981645623,
        "nauc_mrr_at_10_max": -0.07185174301688693,
        "nauc_mrr_at_10_std": -0.12807325096717753,
        "nauc_mrr_at_1_diff1": 0.15610387932529113,
        "nauc_mrr_at_1_max": -0.09824591983546396,
        "nauc_mrr_at_1_std": -0.13914318784294258,
        "nauc_mrr_at_20_diff1": 0.1382786098284509,
        "nauc_mrr_at_20_max": -0.07364476417961506,
        "nauc_mrr_at_20_std": -0.12898192060943495,
        "nauc_mrr_at_3_diff1": 0.13118224861025093,
        "nauc_mrr_at_3_max": -0.08164985279853691,
        "nauc_mrr_at_3_std": -0.13241573571401533,
        "nauc_mrr_at_5_diff1": 0.1346130730317385,
        "nauc_mrr_at_5_max": -0.07404093236468848,
        "nauc_mrr_at_5_std": -0.1340775377068567,
        "nauc_ndcg_at_1000_diff1": 0.15919987960292029,
        "nauc_ndcg_at_1000_max": -0.05457945565481172,
        "nauc_ndcg_at_1000_std": -0.12457339152558143,
        "nauc_ndcg_at_100_diff1": 0.1604091882521101,
        "nauc_ndcg_at_100_max": -0.05281549383775287,
        "nauc_ndcg_at_100_std": -0.12347288098914058,
        "nauc_ndcg_at_10_diff1": 0.1657018523692905,
        "nauc_ndcg_at_10_max": -0.036222943297402846,
        "nauc_ndcg_at_10_std": -0.11284619565817842,
        "nauc_ndcg_at_1_diff1": 0.17346152653542332,
        "nauc_ndcg_at_1_max": -0.09705499215630589,
        "nauc_ndcg_at_1_std": -0.14726476953035533,
        "nauc_ndcg_at_20_diff1": 0.16231721725673165,
        "nauc_ndcg_at_20_max": -0.04147115653921931,
        "nauc_ndcg_at_20_std": -0.11598700704312062,
        "nauc_ndcg_at_3_diff1": 0.15256475371124711,
        "nauc_ndcg_at_3_max": -0.05432154580979357,
        "nauc_ndcg_at_3_std": -0.12841084787822227,
        "nauc_ndcg_at_5_diff1": 0.15236205846534961,
        "nauc_ndcg_at_5_max": -0.04356123278888682,
        "nauc_ndcg_at_5_std": -0.12942556865700913,
        "nauc_precision_at_1000_diff1": -0.038790629929866066,
        "nauc_precision_at_1000_max": 0.3630826341915611,
        "nauc_precision_at_1000_std": 0.4772189839676386,
        "nauc_precision_at_100_diff1": 0.32118609204433185,
        "nauc_precision_at_100_max": 0.4740132817600036,
        "nauc_precision_at_100_std": 0.3456396169952022,
        "nauc_precision_at_10_diff1": 0.22279659689895104,
        "nauc_precision_at_10_max": 0.16823918613191954,
        "nauc_precision_at_10_std": 0.0377209694331257,
        "nauc_precision_at_1_diff1": 0.17346152653542332,
        "nauc_precision_at_1_max": -0.09705499215630589,
        "nauc_precision_at_1_std": -0.14726476953035533,
        "nauc_precision_at_20_diff1": 0.23025740175221762,
        "nauc_precision_at_20_max": 0.2892313928157665,
        "nauc_precision_at_20_std": 0.13522755012490692,
        "nauc_precision_at_3_diff1": 0.1410889527057097,
        "nauc_precision_at_3_max": -0.010771302313530132,
        "nauc_precision_at_3_std": -0.10744937823276193,
        "nauc_precision_at_5_diff1": 0.14012953903010988,
        "nauc_precision_at_5_max": 0.03977485677045894,
        "nauc_precision_at_5_std": -0.10292184602358977,
        "nauc_recall_at_1000_diff1": -0.03879062992990034,
        "nauc_recall_at_1000_max": 0.36308263419153386,
        "nauc_recall_at_1000_std": 0.47721898396760526,
        "nauc_recall_at_100_diff1": 0.3211860920443005,
        "nauc_recall_at_100_max": 0.4740132817599919,
        "nauc_recall_at_100_std": 0.345639616995194,
        "nauc_recall_at_10_diff1": 0.22279659689895054,
        "nauc_recall_at_10_max": 0.16823918613192046,
        "nauc_recall_at_10_std": 0.037720969433127145,
        "nauc_recall_at_1_diff1": 0.17346152653542332,
        "nauc_recall_at_1_max": -0.09705499215630589,
        "nauc_recall_at_1_std": -0.14726476953035533,
        "nauc_recall_at_20_diff1": 0.23025740175221865,
        "nauc_recall_at_20_max": 0.2892313928157675,
        "nauc_recall_at_20_std": 0.13522755012490456,
        "nauc_recall_at_3_diff1": 0.14108895270570979,
        "nauc_recall_at_3_max": -0.010771302313529425,
        "nauc_recall_at_3_std": -0.10744937823276134,
        "nauc_recall_at_5_diff1": 0.14012953903010958,
        "nauc_recall_at_5_max": 0.039774856770459645,
        "nauc_recall_at_5_std": -0.10292184602358935,
        "ndcg_at_1": 0.40754,
        "ndcg_at_10": 0.63616,
        "ndcg_at_100": 0.66063,
        "ndcg_at_1000": 0.6613,
        "ndcg_at_20": 0.65131,
        "ndcg_at_3": 0.55717,
        "ndcg_at_5": 0.59461,
        "precision_at_1": 0.40754,
        "precision_at_10": 0.08841,
        "precision_at_100": 0.00991,
        "precision_at_1000": 0.001,
        "precision_at_20": 0.04716,
        "precision_at_3": 0.22238,
        "precision_at_5": 0.15149,
        "recall_at_1": 0.40754,
        "recall_at_10": 0.88407,
        "recall_at_100": 0.99147,
        "recall_at_1000": 0.99644,
        "recall_at_20": 0.9431,
        "recall_at_3": 0.66714,
        "recall_at_5": 0.75747
      }
    ]
  },
  "task_name": "ArguAna"
}
```

Now we've successfully run the evaluation using mteb! In the next tutorial, we'll show how to evaluate your model on the whole 56 tasks of English MTEB and compete with models on the leaderboard.
