Keywords: LLM Evaluation, Reasoning, Benchmarks, Large Reasoning Models, Post-Training
Abstract: Recent Large Reasoning Models (LRMs) have achieved remarkable progress, yet their evaluation still relies on a narrow paradigm: evaluating one question at a time. This single-question setup suffers from two major limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0\% on MATH500), forcing costly and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment.
To bridge this gap, we present **REST** (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously.
Beyond basic reasoning, REST evaluates several under-tested capabilities: _contextual priority allocation_ and _contextual interference_.
Our evaluation of more than **30** advanced reasoning models on **9** reasoning benchmarks reveals several striking findings:
Even state-of-the-art (SOTA) models such as **_DeepSeek-R1 exhibit substantial performance degradation under stress testing_**, challenging the prevailing assumption that "LLMs are multi-problem solvers".
Crucially, **_REST demonstrates stronger discriminative power_** than existing benchmarks, revealing clear performance gaps among models that exhibit similar, near-ceiling performance under traditional
evaluation. Some key insights emerge from our analysis: (1) the **_"overthinking trap'"_** is a critical factor contributing to the performance degradation; (2) the models trained with **_"Long2Short" technique preserve more of their single-problem accuracy_** under REST, outperforming their standard-trained counterparts.
These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation. Code is available at https://anonymous.4open.science/r/REST_ARR-9EB0.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 2034
Loading