Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Published: 09 Jul 2025, Last Modified: 16 Jul 2025AI4Math@ICML25 PosterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: reasoning, reinforcement learning, large language models, inference scaling
TL;DR: We propose learning a unified generative verifier during RL finetuning for reasoning, allowing for more efficient test-time scaling without added training or inference overhead.
Abstract: In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.
Submission Number: 106
Loading