P-GSPO: Parameterized Group Sequence Policy Optimization for Length-Sensitive Reasoning

09 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning
TL;DR: Task-Adaptive Policy Optimization via Parameterized Length Normalization
Abstract: Policy optimization for LLM reasoning faces a core trade-off between length sensitivity and update stability. Methods that preserve length sensitivity, such as GRPO without length normalization, keep valuable signal for deep multi-step reasoning but lead to high-variance, unstable updates. Methods that enforce rigid length normalization, such as GSPO / GMPO, stabilize training but become length-blind and suppress credit for thorough reasoning. We introduce P-GSPO (Parameterized Group Sequence Policy Optimization), a single-parameter framework that turns this dilemma into a tunable axis. Instead of all-or-nothing normalization, P-GSPO applies a power-law normalization whose strength is controlled by a parameter, directly regulating how sequence length scales the policy update. This recovers the unstable, fully length-sensitive regime and the stable, length-blind regime as endpoints, while exposing a spectrum of balanced operating points. Integrated into masked diffusion LLMs within the d1 framework, P-GSPO yields large gains where length-blindness is most damaging (+19.9 on Countdown, +15.9 on Sudoku) and consistent improvements on math benchmarks (GSM8K, MATH). The takeaway is concise but profound: explicitly modeling and controlling the influence of length is key to achieving both stable training and strong reasoning. All code will be released.
Primary Area: reinforcement learning
Submission Number: 3232
Loading