RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning

Ziye Chen; Chengwei Qin; Yao Shu

RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning

Ziye Chen, Chengwei Qin, Yao Shu

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark, LLM, math, reasoning

TL;DR: We introduce ReIMO, a new benchmark using difficult Olympiad problems with easy-to-grade integer answers to prove that even top LLMs still struggle with high-level mathematical reasoning.

Abstract: As large language models reach high scores on benchmarks like GSM8K and MATH, researchers have started using Olympiad problems for new evaluations. However, grading these problems is difficult because of inconsistent answer formats and unreliable solutions. We present \textbf{RIMO}, a benchmark that keeps the challenging nature of Olympiad problems while ensuring clear and consistent evaluation. RIMO has two tracks: \textbf{RIMO-N}, which includes 335 problems redesigned to have single-integer answers for straightforward grading, and \textbf{RIMO-P}, which features 456 proof problems with expert-checked solutions and an automated grading system. Our results show that even the best LLMs struggle with RIMO, despite performing well on older benchmarks. RIMO reveals a significant gap in current models' reasoning abilities and offers a precise tool for future research.

Submission Number: 111

Loading