PSPO: Trainable Potential-Based Reward Shaping with Internal Model Signals for Post-Training Policy Optimization of Large Language Models

Miaobo Hu; BoKun Wang; Shuhao Hu; 王若晗; Xin Wang; Xiaobo Guo; Daren Zha; Jun Xiao

PSPO: Trainable Potential-Based Reward Shaping with Internal Model Signals for Post-Training Policy Optimization of Large Language Models

Miaobo Hu, BoKun Wang, Shuhao Hu, 王若晗, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Large Language Models, Reward Shaping, Policy Optimization, Critic-Free Methods, Internal Model Signals

TL;DR: We introduce PSPO, a critic-free RLHF framework that uses trainable potential functions over internal LLM signals to convert sparse scalar rewards into dense token-level feedback.

Abstract: Reinforcement learning from human feedback (RLHF) has become the de-facto paradigm for aligning large language models (LLMs), yet mainstream algorithms either incur the memory overhead of a value head (PPO) or remain vulnerable to sparse and miscalibrated rewards (GRPO, DPO). We propose Potential-Shaped Policy Optimization (PSPO), a lightweight, critic-free framework that converts coarse scalar feedback into dense, context-aware signals by learning a trainable potential function. A 22.7M-parameter MiniLM network (the Potential Network) ingests inexpensive internal model signals (token embeddings, attention entropy, policy entropy) to produce adaptive shaping terms, while an alternating optimization scheme stably co-trains the policy and potential without extra rollouts. On eight English and Chinese mathematical-reasoning benchmarks, a Qwen2.5-14B model trained with PSPO achieves strong accuracy under a shared 300M-token RLHF budget (68.1\% on GSM8K; 41.6\% on MATH) and exceeds PPO/DPO/GRPO by up to about 10 accuracy points across these benchmarks in this matched setting. Beyond math, PSPO also improves open-ended instruction following on ShareGPT and HelpfulQA under the same backbone. PSPO remains critic-free and adds only <3% wall-clock overhead in our measurements, while yielding interpretable token-level reward attributions. Taken together, these results highlight signal-aware reward shaping as a practical route toward more efficient and stable RLHF for decoder-only language models in long-horizon, sparse-reward settings.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 8558

Loading