Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning
Keywords: Test-Time Reinforcement Learning; Self-rewarding Mechanism; Process Reward; Outcome Reward
Abstract: Most recently, Reinforcement Learning (RL) has empowered frontier Large Language Models (LLMs) to solve challenging math, science, and coding problems. This paper consentrates on RL on data without explicit labels for reasoning tasks in LLMs. The core challenge of the problem is reward estimation during inference in absense of ground-truth information. In this work, we propose COMPASS: Composite Path and Answer Self-Scoring - a novel method for training LLMs using RL on unlabeled test data. COMPASS consists of Dual-Calibration Answer Reward (DCAR) and Decisive Path Reward (DPR), which enables self-evolution of LLMs by fully utilizing the priors in the pre-trained models as intrinsic rewards. We find that by simultaneously reinforcing the trustworthy consensus answers and chains of thought that yield high model desiciveness on its generated responses, the model improves its reasoning ability. Our experiments demonstrate that COMPASS consistently improves performance across a variety of tasks and models, marking a further step of learning from continuous streams of experience.
Primary Area: reinforcement learning
Submission Number: 5532
Loading