LongReasonArena: A Long Reasoning Benchmark for Large Language Models

LongReasonArena: A Long Reasoning Benchmark for Large Language Models

ACL ARR 2026 January Submission2265 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, long context, long reasoning, test-time scaling

Abstract: Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduceLongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,evaluation methodologies,evaluation,chain-of-thought

Languages Studied: English

Submission Number: 2265

Loading