Keywords: Mathematical retrieval, Mathematical comprehension, Large language models
TL;DR: A large-scale, multimodal, multilingual dataset of math problems for evaluating LLMs on equivalence retrieval and reasoning
Abstract: Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce *MathNet*, a large-scale, high-quality, multilingual, and multimodal dataset of Olympiad-level problems. MathNet spans 40 countries, 10 languages, and two decades of competitions, comprising 17,512 **expert-authored problems with solutions** across diverse domains.
*MathNet* supports three tasks: (i) *mathematical comprehension*, (ii) *mathematical retrieval*, an underexplored but essential capability and (iii) *Math RAG*, which evaluates how retrieval-augmented generation improves problem solving. For retrieval, we construct 39K pairs of mathematically equivalent problems to enable equivalence-based evaluation, in addition to 70 expert-curated pairs from real competitions. Experimental results show that even state-of-the-art reasoning models (76.8% for GPT-5 and 46.8% for Claude 4.5 Opus) are challenged, while embedding models struggle to retrieve equivalent problems. Finally, we show that LLM performance in RAG-based math problem solving is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark.
*MathNet* provides the largest high-quality Olympiad dataset and the first retrieval benchmark for problem equivalence. We publicly release both the dataset and benchmark at http://mathnet.netlify.app/.
Primary Area: datasets and benchmarks
Submission Number: 6594
Loading