# Condensed: MTEB Leaderboard

Summary: This tutorial demonstrates how to implement and evaluate embedding models using the MTEB (Massive Text Embedding Benchmark) framework. It covers implementation techniques for running comprehensive evaluations across 7 task types (Classification, Clustering, Pair Classification, Reranking, Retrieval, STS, and Summarization) using sentence-transformers. The tutorial helps with tasks like model evaluation, leaderboard submission, and partial task-specific testing. Key features include full and selective dataset evaluation, automated model card generation, and detailed metrics calculation for each task type. It's particularly useful for benchmarking embedding models and preparing submissions for the MTEB leaderboard.

*This is a condensed version that preserves essential implementation details and context.*

Here's the condensed tutorial focusing on essential implementation details:

# MTEB Leaderboard Implementation Guide

## Key Setup
```python
%pip install sentence_transformers mteb
```

## Core Implementation Details

### 1. Full Evaluation

MTEB English leaderboard evaluates 7 task types across 56 datasets:
- Classification (Logistic regression + F1 score)
- Clustering (k-means, v-measure scoring)
- Pair Classification (Binary classification + AP score)
- Reranking (MRR@k and MAP metrics)
- Retrieval (nDCG@k metric, BEIR-based)
- STS (Spearman correlation with cosine similarity)
- Summarization (Embedding distance + Spearman correlation)

```python
# Load model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# OR
model = mteb.get_model("BAAI/bge-base-en-v1.5")

# Run evaluation
for task in MTEB_MAIN_EN.tasks:
    eval_splits = ["dev"] if task == "MSMARCO" else ["test"]
    evaluation = mteb.MTEB(tasks=[task], task_langs=["en"])
    evaluation.run(model, output_folder="results", eval_splits=eval_splits)
```

### 2. Leaderboard Submission
```python
# Generate model card
!mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md

# For existing readme
!mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md --from_existing your_existing_readme.md
```

### 3. Partial Evaluation
Example for clustering-specific evaluation:
```python
TASK_LIST_CLUSTERING = [
    "ArxivClusteringP2P", "ArxivClusteringS2S",
    "BiorxivClusteringP2P", "BiorxivClusteringS2S",
    "MedrxivClusteringP2P", "MedrxivClusteringS2S",
    "RedditClustering", "RedditClusteringP2P",
    "StackExchangeClustering", "StackExchangeClusteringP2P",
    "TwentyNewsgroupsClustering"
]

evaluation = mteb.MTEB(tasks=TASK_LIST_CLUSTERING)
results = evaluation.run(model, output_folder="results")
```

## Important Notes
- ⚠️ Full MTEB evaluation is computationally intensive and time-consuming, even with GPU
- Partial evaluation is supported for specific task types
- Results are stored in `results/{model_name}/{model_revision}`
- Leaderboard updates daily after submission
- Check [MTEB GitHub](https://github.com/embeddings-benchmark/mteb) for updates and new releases