DREAM: Decoupled Reinforcement Learning with Reward Measurement for Large Language Model Test-time Training

Yunpeng Zhao; Qixin Zhang; Cheng Chen; Yiwei Fu; Xiao Luo

DREAM: Decoupled Reinforcement Learning with Reward Measurement for Large Language Model Test-time Training

Yunpeng Zhao, Qixin Zhang, Cheng Chen, Yiwei Fu, Xiao Luo

20 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Test-Time Training, Large Language Model

Abstract: This paper studies the problem of large language model (LLM) test-time training, which aims to enhance the reasoning ability of LLMs via unlabeled test data. Recent works usually utilize majority voting to infer the labels of samples to guide the reinforcement learning process, which could be inaccurate and biased with potential error accumulation. Towards this end, we propose a novel approach named Decoupled Reinforcement Learning with Reward Measurement (DREAM) for LLM test-time training. The core of our proposed DREAM is to decouple the reward estimation from reinforcement learning with enhanced calibration. In particular, our DREAM trains an LLM-based calibration model which takes both questions and answers as input, and outputs the calibration scores. To mitigate overconfident results, the judge model is trained by simulating on an independent reference dataset with positive and negative pairs. The reference-based calibration scores would be incorporated into voting-based reward estimation to reduce the potential biases, which enhance reliable test-time training. Extensive experiments on benchmark datasets validate the superiority of the proposed DREAM in comparison with competing baselines.

Primary Area: reinforcement learning

Submission Number: 24201

Loading