Summary: This tutorial provides implementation guidance for C-MTEB (Chinese Multi-Task Embedding Benchmark), covering model integration and evaluation across multiple NLP tasks. It demonstrates how to implement embedding models using FlagDRESModel or SentenceTransformer, with specific focus on encoding methods for various tasks like classification, clustering, and retrieval. The tutorial details task-specific metrics (F1, v-measure, MRR@k, etc.), evaluation procedures, and leaderboard submission process. Key functionalities include batch processing, GPU acceleration, and handling different Chinese NLP tasks with appropriate metrics and evaluation approaches. This knowledge is particularly useful for implementing and evaluating embedding models on Chinese language tasks.

# C-MTEB

C-MTEB is the largest benchmark for Chinese text embeddings, similar to MTEB. In this tutorial, we will go through how to evaluate an embedding model's ability on Chinese tasks in C-MTEB.

## 0. Installation

First install dependent packages:


```python
%pip install FlagEmbedding mteb
```

## 1. Datasets

C-MTEB uses similar task splits and metrics as English MTEB. It contains 35 datasets in 6 different tasks: Classification, Clustering, Pair Classification, Reranking, Retrieval, and Semantic Textual Similarity (STS). 

1. **Classification**: Use the embeddings to train a logistic regression on the train set and is scored on the test set. F1 is the main metric.
2. **Clustering**: Train a mini-batch k-means model with batch size 32 and k equals to the number of different labels. Then score using v-measure.
3. **Pair Classification**: A pair of text inputs is provided and a label which is a binary variable needs to be assigned. The main metric is average precision score.
4. **Reranking**: Rank a list of relevant and irrelevant reference texts according to a query. Metrics are mean MRR@k and MAP.
5. **Retrieval**: Each dataset comprises corpus, queries, and a mapping that links each query to its relevant documents within the corpus. The goal is to retrieve relevant documents for each query. The main metric is nDCG@k. MTEB directly adopts BEIR for the retrieval task.
6. **Semantic Textual Similarity (STS)**: Determine the similarity between each sentence pair. Spearman correlation based on cosine
similarity serves as the main metric.


Check the [HF page](https://huggingface.co/C-MTEB) for the details of each dataset.


```python
ChineseTaskList = [
    'TNews', 'IFlyTek', 'MultilingualSentiment', 'JDReview', 'OnlineShopping', 'Waimai',
    'CLSClusteringS2S.v2', 'CLSClusteringP2P.v2', 'ThuNewsClusteringS2S.v2', 'ThuNewsClusteringP2P.v2',
    'Ocnli', 'Cmnli',
    'T2Reranking', 'MMarcoReranking', 'CMedQAv1-reranking', 'CMedQAv2-reranking',
    'T2Retrieval', 'MMarcoRetrieval', 'DuRetrieval', 'CovidRetrieval', 'CmedqaRetrieval', 'EcomRetrieval', 'MedicalRetrieval', 'VideoRetrieval',
    'ATEC', 'BQ', 'LCQMC', 'PAWSX', 'STSB', 'AFQMC', 'QBQTC'
]
```

## 2. Model

First, load the model for evaluation. Note that the instruction here is used for retreival tasks.


```python
from ...C_MTEB.flag_dres_model import FlagDRESModel

instruction = "为这个句子生成表示以用于检索相关文章："
model_name = "BAAI/bge-base-zh-v1.5"

model = FlagDRESModel(model_name_or_path="BAAI/bge-base-zh-v1.5",
                      query_instruction_for_retrieval=instruction,
                      pooling_method="cls")
```

Otherwise, you can load a model using sentence_transformers:


```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("PATH_TO_MODEL")
```

Or implement a class following the structure below:

```python
class MyModel():
    def __init__(self):
        """initialize the tokenizer and model"""
        pass

    def encode(self, sentences, batch_size=32, **kwargs):
        """ Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()
```

## 3. Evaluate

After we've prepared the dataset and model, we can start the evaluation. For time efficiency, we highly recommend to use GPU for evaluation.


```python
import mteb
from mteb import MTEB

tasks = mteb.get_tasks(ChineseTaskList)

for task in tasks:
    evaluation = MTEB(tasks=[task])
    evaluation.run(model, output_folder=f"zh_results/{model_name.split('/')[-1]}")
```

## 4. Submit to MTEB Leaderboard

After the evaluation is done, all the evaluation results should be stored in `zh_results/{model_name}/`.

Then run the following shell command to create the model_card.md. Change {model_name} and its following to your path.


```python
!!mteb create_meta --results_folder results/{model_name}/ --output_path model_card.md
```

Copy and paste the contents of model_card.md to the top of README.md of your model on HF Hub. Then goto the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) and choose the Chinese leaderboard to find your model! It will appear soon after the website's daily refresh.
