MMRC: Measuring Massive-Computational Math Reasoning with Code in LLMs

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark; Math-code integrated reasoning;
TL;DR: A fully human-collected and adapted high-computational-load benchmark for math-code integrated reasoning.
Abstract: We introduce MMRC, a benchmark of 500 curated problems designed to evaluate large language models (LLMs) on their ability to integrate mathematical reasoning with code execution. Unlike existing benchmarks that emphasize either elementary or olympiad-level mathematics, MMRC focuses on university-level subjects including calculus, discrete mathematics, linear algebra, linear programming, mathematical physics, numerical computation, and scientific computing. Each problem is deliberately adapted to impose substantial computational workloads, making text-only reasoning infeasible and requiring code execution for accurate solutions. We evaluate 120 open- and closed-source models on MMRC under two paradigms: Code-Invoked Reasoning (CIR), where models generate Python scripts, and Math-Code Agent (MCA), where models dynamically interact with a Python interpreter. Code integration consistently improves accuracy on complex tasks while reducing token usage, establishing MMRC as the first systematic benchmark for math–code integrated reasoning and advancing LLMs toward real-world problem solving.
Primary Area: datasets and benchmarks
Submission Number: 10756
Loading