Discriminative Policy Optimization for Token-Level Reward Models

Hongzhan Chen; Tao Yang; Shiping Gao; Ruijun Chen; Xiaojun Quan; Hongtao Tian; Ting Yao

Discriminative Policy Optimization for Token-Level Reward Models

Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy from preference data.

Abstract: Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12× faster than ORM on GSM8K and 11× faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.

Lay Summary: Training AI to solve complex problems, like math puzzles, often involves giving feedback on each step it takes rather than just the final answer. However, current methods that provide this step-by-step feedback can clash with how the AI generates text, leading to unstable training and unreliable learning. To fix this, we developed a new approach called the Q-function Reward Model (Q-RM). Instead of mixing feedback with text generation, Q-RM evaluates each step independently, like a coach focusing on individual moves rather than the whole game. When tested on math challenges, AI trained with Q-RM scored significantly higher than methods using only final-answer feedback. It also learned faster, reaching solutions 12 times quicker on some tasks. By making training both more efficient and effective, Q-RM helps AI tackle complicated problems with greater reliability.

Link To Code: https://github.com/homzer/Q-RM

Primary Area: Deep Learning->Large Language Models

Keywords: Reward Model, Reinforcement Learning, LLM

Submission Number: 4644

Loading