VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Jianing Qi; Hao Tang; Zhigang Zhu

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Jianing Qi, Hao Tang, Zhigang Zhu

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, test time compute, Reinforcement Learning, Q-Learning, verifier

TL;DR: VerifierQ integrates Q-learning into LLM verifier models, enhancing test time compute for improved multi-step reasoning capabilities.

Abstract:

Recent test time compute approaches with verifier models have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). While this kind of generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL), the verifiers currently are used rely on supervised fine-tuning rather than on temporal difference learning. This paper introduces VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models. We address three key challenges in applying Q-learning to LLMs: utterance-level Markov Decision Processes (MDPs), large action spaces, and overestimation bias. VerifierQ introduces a modified Bellman update, incorporates Implicit Q-learning (IQL) for efficient action space management, and integrates a novel Conservative Q-learning (CQL) formulation for balanced overestimation. Our method is among the first to apply Q-learning to LLM verifiers. This integration of RL principles into verifier models complements existing advancements in generator techniques. Experimental results on mathematical reasoning tasks demonstrate VerifierQ's superior performance compared to supervised fine-tuning approaches.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12703

Loading