CURE: Advancing Reasoning via Self-Consistent Reward in Test-Time Experience

ACL ARR 2026 January Submission736 Authors

24 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Test-Time Scaling, Reasoning, and Unlabeled data
Abstract: Test-Time Scaling (TTS) has emerged as an effective paradigm for improving the reasoning performance of Large Language Models by allocating additional computation during inference. Existing TTS frameworks frequently utilize process reward models to improve performance, yet the substantial computational cost of training PRMs remains a major limitation. To address this limitation, we propose Context-Aware Unlabeled Reward Reasoning (CURE), a novel TTS framework designed for both intensive reasoning and knowledge-intensive tasks. Given an input question, CURE first retrieves the most relevant questions from the test set. Conditioned on retrieved questions, LLMs then perform Context-Reward Reasoning to generate candidate answers to the original question. The final answer is obtained via majority voting over these candidate answers. Since the retrieved questions lack ground-truth labels, we sample multiple predictions and get pseudo-labels via majority voting, which are then utilized to generate reward messages. CURE is evaluated on competitive reasoning and knowledge-intensive tasks, where it demonstrates state-of-the-art potential. For example, CURE markedly improves Qwen2.5-7B by 25.29% on average. Crucially, CURE-augmented smaller models exhibit competitive superiority over massive baselines, with Qwen2.5-7B exceeding the performance of Qwen2.5-72B by 2.08 points. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at https://anonymous.4open.science/r/CURE_cont_aware_reward_reasoning.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: chain-of-thought, retrieval-augmented generation, and scaling
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 736
Loading