---
license: mit
configs:
- config_name: M-IMO
  data_files:
  - split: test
    path: m-imo.parquet
- config_name: MT-MATH100
  data_files:
  - split: test
    path: mt-math100.parquet
- config_name: MT-AIME2024
  data_files:
  - split: test
    path: mt-aime2024.parquet
---

# Multilingual Competition Level Math (MCLM)

Link to Paper: https://arxiv.org/abs/2502.17407

**Overview:**  
MCLM is a benchmark designed to evaluate advanced mathematical reasoning in a multilingual context. It features competition-level math problems across 55 languages, moving beyond standard word problems to challenge even state-of-the-art large language models.

---

## Dataset Composition

MCLM is constructed from two main types of reasoning problems:

- **Machine-translated Reasoning:**  
  - Derived from established benchmarks like MATH-500 and AIME 2024.
  - Questions are translated into 55 languages using GPT-4o, with verification to ensure answer consistency.

- **Human-annotated Reasoning:**  
  - Comprises official translations of International Mathematical Olympiad (IMO) problems (2006–2024) in 38 languages.
  - Includes additional problems from domestic and regional math olympiads in 11 languages.
  
---

## Benchmark Subsets

| **Subset**    | **Source Benchmark**        | **Languages** | **Samples per Language** | **Evaluation Method**     |
|---------------|-----------------------------|---------------|--------------------------|---------------------------|
| MT-MATH100    | Math-500                    | 55            | 100                      | Rule-based verifier       |
| MT-AIME2024   | AIME 2024                   | 55            | 30                       | Rule-based verifier       |
| M-IMO         | IMO (2006, 2024)            | 38            | 22–27                    | LLM-as-a-Judge            |
| M-MO          | Domestic/Regional Olympiads | 11            | 28–31                    | LLM-as-a-Judge            |

---

## Model Performance on MCLM

| **Model**                                           | **MT-MATH100** | **MT-AIME2024** | **M-IMO** | **M-MO** | **Average** |
|-----------------------------------------------------|----------------|-----------------|-----------|----------|-------------|
| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B           | 49.40          | 17.21           | 21.94     | 26.77    | 28.83       |
| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B             | 62.64          | 26.55           | 28.48     | 38.95    | 39.15       |
| deepseek-ai_DeepSeek-R1-Distill-Qwen-32B            | 70.65          | 31.03           | 31.71     | 43.22    | 44.15       |
| o3-mini                                             | 84.89          | 45.33           | 29.75     | 51.42    | 52.85       |

---


## Citation

```
@article{son2025linguistic,
  title={Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning},
  author={Son, Guijin and Hong, Jiwoo and Ko, Hyunwoo and Thorne, James},
  journal={arXiv preprint arXiv:2502.17407},
  year={2025}
}
```

## Contact

```
spthsrbwls123@yonsei.ac.kr
```
