Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Kusha Sareen; Morgane M Moss; Alessandro Sordoni; Arian Hosseini; Rishabh Agarwal

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Arian Hosseini, Rishabh Agarwal

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Reasoning, Verifiers, Test-time scaling, Reinforcement Learning

TL;DR: We propose to improve inference-time scaling by jointly training a unified model for both reasoning and verification.

Abstract: Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.5-2\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 527

Loading