MathNet: A Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Shaden Alshammari; Kevin Wen; Abrar Zainal; Mark Hamilton; Navid Safaei; Sultan Albarakati; William T. Freeman; Antonio Torralba

MathNet: A Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei, Sultan Albarakati, William T. Freeman, Antonio Torralba

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mathematical retrieval, Mathematical comprehension, Large language models

TL;DR: A large-scale, multimodal, multilingual dataset of math problems for evaluating LLMs on equivalence retrieval and reasoning

Abstract: Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce *MathNet*, a large-scale, high-quality, multilingual, and multimodal dataset of Olympiad-level problems. MathNet spans 40 countries, 10 languages, and two decades of competitions, comprising 17,512 **expert-authored problems with solutions** across diverse domains. *MathNet* supports three tasks: (i) *mathematical comprehension*, (ii) *mathematical retrieval*, an underexplored but essential capability and (iii) *Math RAG*, which evaluates how retrieval-augmented generation improves problem solving. For retrieval, we construct 39K pairs of mathematically equivalent problems to enable equivalence-based evaluation, in addition to 70 expert-curated pairs from real competitions. Experimental results show that even state-of-the-art reasoning models (76.8% for GPT-5 and 46.8% for Claude 4.5 Opus) are challenged, while embedding models struggle to retrieve equivalent problems. Finally, we show that LLM performance in RAG-based math problem solving is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. *MathNet* provides the largest high-quality Olympiad dataset and the first retrieval benchmark for problem equivalence. We publicly release both the dataset and benchmark at http://mathnet.netlify.app/.

Primary Area: datasets and benchmarks

Submission Number: 6594

Loading