Arithmetic-Bench: Evaluating Multi-Step Reasoning in LLMs with Basic Arithmetic

ICLR 2026 Conference Submission19026 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning, Arithmetic Benchmark
Abstract: We propose Arithmetic-Bench, a benchmark designed to evaluate the multi-step reasoning ability of large language models (LLMs) through basic arithmetic operations. The benchmark covers fundamental mathematical operations such as addition, subtraction, multiplication, and division, while also incorporating subtasks like copying, reversing, counting, and base conversion. Experimental results show that the accuracy of current LLMs drops sharply when performing arithmetic operations involving more than 10 digits, implying a failure of generalization in multi-step reasoning. We further analyze the root causes of these failures. While LLMs can achieve a certain degree of arithmetic generalization through training on limited-length sequences, they fail to generalize to arbitrary lengths. This is due to the inherent complexity of arithmetic tasks: achieving true arithmetic generalization cannot rely on memorization alone but requires the acquisition of genuine reasoning mechanisms. Compared to other math benchmarks, Arithmetic-Bench provides a simple and fair framework. Because the tasks are purely synthetic, they are easy to generate and largely free from human biases. We believe that arithmetic tasks are both fundamental and necessary for advancing reasoning models, and Arithmetic-Bench offers a principled way to evaluate them.
Primary Area: datasets and benchmarks
Submission Number: 19026
Loading