MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Shaden Alshammari; Kevin Wen; Abrar Zainal; Mark Hamilton; Navid Safaei; Sultan Albarakati; William T. Freeman; Antonio Torralba

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei, Sultan Albarakati, William T. Freeman, Antonio Torralba

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mathematical retrieval, Mathematical comprehension, Large language models

TL;DR: A large-scale, multimodal, multilingual dataset of math problems for evaluating LLMs on equivalence retrieval and reasoning

Abstract: Mathematical problem solving remains a demanding test of reasoning for large language and multimodal models, yet existing benchmarks are small, monolingual, and limited in scope. We present *MathNet*, the first large-scale, multilingual, and multimodal dataset of Olympiad-level problems. Spanning 40 countries, 10 languages, and two decades of competitions, MathNet contains 13,026 expert-authored problems with solutions across diverse domains. MathNet supports two tasks: (i) mathematical comprehension and (ii) mathematical retrieval, an underexplored but essential capability. For retrieval, we construct 39K pairs of mathematically equivalent problems to enable equivalence-based evaluation. Experimental results show that even state-of-the-art reasoning models (72% and 66% accuracy for GPT-5 and Gemini 2.5 Pro) are challenged, while embedding models exhibit substantial difficulty in retrieving equivalent problems. MathNet provides the largest multilingual Olympiad dataset and the first retrieval benchmark for mathematical equivalence, which we will publicly release.

Submission Number: 206

Loading