REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

ACL ARR 2026 January Submission2034 Authors

01 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Evaluation, Reasoning, Benchmarks, Large Reasoning Models, Post-Training
Abstract: Recent Large Reasoning Models (LRMs) have achieved remarkable progress, yet their evaluation still relies on a narrow paradigm: evaluating one question at a time. This single-question setup suffers from two major limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0\% on MATH500), forcing costly and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present **REST** (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST evaluates several under-tested capabilities: _contextual priority allocation_ and _contextual interference_. Our evaluation of more than **30** advanced reasoning models on **9** reasoning benchmarks reveals several striking findings: Even state-of-the-art (SOTA) models such as **_DeepSeek-R1 exhibit substantial performance degradation under stress testing_**, challenging the prevailing assumption that "LLMs are multi-problem solvers". Crucially, **_REST demonstrates stronger discriminative power_** than existing benchmarks, revealing clear performance gaps among models that exhibit similar, near-ceiling performance under traditional evaluation. Some key insights emerge from our analysis: (1) the **_"overthinking trap'"_** is a critical factor contributing to the performance degradation; (2) the models trained with **_"Long2Short" technique preserve more of their single-problem accuracy_** under REST, outperforming their standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation. Code is available at https://anonymous.4open.science/r/REST_ARR-9EB0.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 2034
Loading