Keywords: Reinforcement learning
TL;DR: Task-Adaptive Policy Optimization via Parameterized Length Normalization
Abstract: Policy optimization for LLM reasoning faces a core trade-off between length sensitivity and update stability. Methods that preserve length sensitivity, such as
GRPO without length normalization, keep valuable signal for deep multi-step
reasoning but lead to high-variance, unstable updates. Methods that enforce
rigid length normalization, such as GSPO / GMPO, stabilize training but become
length-blind and suppress credit for thorough reasoning. We introduce P-GSPO
(Parameterized Group Sequence Policy Optimization), a single-parameter framework that turns this dilemma into a tunable axis. Instead of all-or-nothing normalization, P-GSPO applies a power-law normalization whose strength is controlled by a parameter, directly regulating how sequence length scales the policy
update. This recovers the unstable, fully length-sensitive regime and the stable,
length-blind regime as endpoints, while exposing a spectrum of balanced operating points. Integrated into masked diffusion LLMs within the d1 framework,
P-GSPO yields large gains where length-blindness is most damaging (+19.9 on
Countdown, +15.9 on Sudoku) and consistent improvements on math benchmarks
(GSM8K, MATH). The takeaway is concise but profound: explicitly modeling
and controlling the influence of length is key to achieving both stable training and
strong reasoning. All code will be released.
Primary Area: reinforcement learning
Submission Number: 3232
Loading