Keywords: Benchmark, LLM, math, reasoning
TL;DR: We introduce ReIMO, a new benchmark using difficult Olympiad problems with easy-to-grade integer answers to prove that even top LLMs still struggle with high-level mathematical reasoning.
Abstract: As large language models reach high scores on benchmarks like GSM8K and MATH, researchers have started using Olympiad problems for new evaluations. However, grading these problems is difficult because of inconsistent answer formats and unreliable solutions. We present \textbf{RIMO}, a benchmark that keeps the challenging nature of Olympiad problems while ensuring clear and consistent evaluation. RIMO has two tracks: \textbf{RIMO-N}, which includes 335 problems redesigned to have single-integer answers for straightforward grading, and \textbf{RIMO-P}, which features 456 proof problems with expert-checked solutions and an automated grading system. Our results show that even the best LLMs struggle with RIMO, despite performing well on older benchmarks. RIMO reveals a significant gap in current models' reasoning abilities and offers a precise tool for future research.
Submission Number: 111
Loading