Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

Chenwei Tang; Jingyu Xing; Xinyu Liu; Wei Ju; Fan Zhang; Deng Xiong; Jiancheng Lv; Ziyue Qiao

Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

Chenwei Tang, Jingyu Xing, Xinyu Liu, Wei Ju, Fan Zhang, Deng Xiong, Jiancheng Lv, Ziyue Qiao

15 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Test-Time Reinforcement Learning; Self-rewarding Mechanism; Process Reward; Outcome Reward

Abstract: Most recently, Reinforcement Learning (RL) has empowered frontier Large Language Models (LLMs) to solve challenging math, science, and coding problems. This paper consentrates on RL on data without explicit labels for reasoning tasks in LLMs. The core challenge of the problem is reward estimation during inference in absense of ground-truth information. In this work, we propose COMPASS: Composite Path and Answer Self-Scoring - a novel method for training LLMs using RL on unlabeled test data. COMPASS consists of Dual-Calibration Answer Reward (DCAR) and Decisive Path Reward (DPR), which enables self-evolution of LLMs by fully utilizing the priors in the pre-trained models as intrinsic rewards. We find that by simultaneously reinforcing the trustworthy consensus answers and chains of thought that yield high model desiciveness on its generated responses, the model improves its reasoning ability. Our experiments demonstrate that COMPASS consistently improves performance across a variety of tasks and models, marking a further step of learning from continuous streams of experience.

Primary Area: reinforcement learning

Submission Number: 5532

Loading