VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Published: 09 Jul 2025, Last Modified: 25 Jul 2025AI4Math@ICML25 PosterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: verifier, rl, test time verification
Abstract: Recent test time compute approaches with verifier models have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). While this kind of generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL), current verifiers typically rely on supervised fine-tuning rather than temporal-difference methods. We propose VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models to address three well-known challenges of LLMs: utterance-level Markov Decision Processes (MDPs), large action spaces, and overestimation bias. VerifierQ introduces a modified bounded Bellman update to keep Q-values in $[0,1]$, incorporates Implicit Q-learning (IQL) for efficient approximation of the $\max Q$ over large action space, and introduces an adjustable two-expectile Conservative Q-learning (CQL). Our method is among the first attempts to integrate offline Q-learning to LLM verifiers, and adjustable conservatism can tighten Q-values around the true $\max Q$ more flexibly than IQL alone. Experimental results on mathematical reasoning tasks demonstrate VerifierQ's superior performance compared to supervised fine-tuning approaches.
Submission Number: 10
Loading